The 2019 coronavirus disease (COVID-19) is a very contagious viral infection disease and thus far remains to be spread aggressively and has become a serious global health crisis [1-3]. So far, this disease has affected almost all countries with more than 2.5 million deaths worldwide . The clinical manifestations of virus ranged from asymptomatic or mild flu like symptoms to severe complications including respiratory insufficiency and intensive care unit (ICU) hospitalization, where patients may be intubated for mechanical ventilation and ultimately death [5, 6].
The COVID-19 high transmission rates, unknown clinical patterns, lack of approved drug therapy or vaccines coupled with a long incubation period put a lot of pressure on healthcare organizations by increasing demands for medical services and the surge in hospitalization volumes [7, 8]. In response to this pandemic, the overwhelmed hospitals around the world, have endeavored to curb the outbreak by leverage predictive models for achieve proper decision-makings. COVID-19 has exposed the health systems to serious scarcity of hospital resources, i.e. beds, oxygen generator, personnel and etc. and overtiredness of care workers, which demands in advance accurate prediction models to efficiently triage patients and make best use of limited resources [9, 10].
It highlights the need for objective and evidence-based solutions for the effective use of medical resources available in hospitals, e.g. hospital beds, personnel, respiratory ventilators, etc. to prevent hospital overwhelming and optimal use of medical resources [11-13]. To reduce the pressure on hospitals and provide the best care for patients, especially in overwhelmed hospitals, it is necessary to effectively predict the Length of Stay (LOS). Therefore, an exact approximation of the patients' LOS would be of substantial worth for scientifically dealing with both medical resources and the distribution of caregivers [7, 14, 15]. Decreasing the LOS is significantly effective in managing of patient flows, enhancing resource utilization, improving patients' safety and reducing healthcare costs [16-20].
For this aim, early and accurate estimation of the hospital LOS metrics, would allow for optimal management of limited medical resources, hospital staffing, better patient scheduling and effective bed turnover process. In addition, health care organizations could help the design of well-organized clinical pathways and recognize the bottlenecks to improve resource utilization, resource allocation proactively, and better healthcare supply chain management [16, 17]. It is significant for hospital administrator, clinicians and health policymakers to make proper decisions for allocating of restricted medical resources particularly during current COVID‐19 pandemic [17, 21]. Given this context, the ability to identify patients who are at risk for prolonged LOS during their hospitalization episode may be useful for identifying and prioritizing patient requirements, care planning, and optimizing service delivery . Many healthcare systems across the world struggle to predict the prolonged LOS by leverage Machine Learning (ML) models for achieving proper decision-makings [23-27]. ML as a sub-branch of Artificial Intelligence (AI) has been extensively known as an efficient and promising analytical techniques for achieve proper decision-makings in healthcare by automatically extracting practical patterns from structured big dataset [28, 29]. These techniques are well-known tools for developing predictive models and can implicitly extract useful information from raw datasets . In the previous studies, a large number of ML algorithms were trained for the forecast and classification of hospital LOS especially for cardiovascular [11, 31-33], malignancies [34-36], and orthopedics [37-39] conditions.
Towards this aim, we developed and compared several ML algorithms to predict COVID-19 LOS according to routine clinical data at admission time. More precisely, the study questions posed for the experiment were: Which prediction model presents better performance? Which prediction models are more efficient? And which models exploit a higher accuracy?
Material and methods
This is a retrospective, single-center, and cross-sectional study, that was conducted in 2021 for predicting the COVID-19 patients LOS based on selected data-driven ML techniques. It was conducted in five stages, including, 1- data set description and participants, 2- data preprocessing, 3- feature reduction, 4- model development, 5- experiment evaluation and 6- ethical consideration.
Data set description and participants
In this study, a COVID-19 hospital-based registry database from Imam Khomeini hospital, Ilam city, West of Iran, was retrospectively reviewed. Only COVID-19 patients with positive real-time reverse-transcriptase PCR (RT-PCR) test admitted from January 9, 2020, until January 20, 2021, met our inclusion criteria to be included in this study. During this period, a total of 12885 suspected cases with COVID-19 referred to Imam Khomeini hospital ambulatory and Emergency Departments (EDs). Of those, 3350 cases were introduced as confirmed COVID-19 by RT-PCR test. After applying the inclusion/ exclusion criteria, finally, 1225 records were fed in the study (Figure 1). In order to protect the privacy and confidentiality of patients, we concealed the unique identification information of all patients in the process of data collection and presentation.
Figure 1: Flow chart describing patient selection
The exclusion criteria for patient selection included 1) Non-COVID-19 cases or non-hospitalized COVID-19 or patients with unknown disposition, 2) Patients who were less than 18 years of age, 3) Incomplete case records (missing more than 70%) and 4) Admission time before January 9, 2020, or after January 20, 2021.
The included cases were defined based on 53 risk factors in five main classes including patients’ demographics (five variables), clinical manifestations (14 variables), comorbidities (seven variables), laboratory (26 variables), and treatment (one variable) (See Table 2).
Incomplete case records which had a lot of missing data (more than 70%) were excluded from the analysis. Also, the remaining missing values were imputed with the mean or mode of each variable. Noisy and abnormal values, errors, duplicates, and meaningless data were checked by researchers in collaboration with two infectious diseases specialists and hematologists. For different interpretations about data preprocessing, we contacted the corresponding physicians.
Feature selection or variable selection is an effective technique that is used to determine the most meaningful variables, and reduce the dimensions of the dataset and improve the efficiency of ML algorithms . In this study, the variables with a correlation coefficient value less than 0.2 (P-value <0.2) were identified as effective risk factors in predicting the LOS of COVID-19 patients and included in the ML models.
To compare the performance of selected ML algorithms including Artificial Neural Network (ANN), Radial Basis Function (RBF), Support Vector Machine (SVM), Feedforward Neural Network (FNN), Probabilistic Neural Network (PNN), Pattern recognition network, and Decision Tree (DT), we carried out an experiment that concentrated on evaluating both the effectiveness and the efficiency of the models. The parameters of models used are shown in Table 1.
All experiments were tested and implemented by using Python programming languages and the Scikit-learning library tools. Scikit-learning tools contain a set of ML algorithms for classification or prediction. ML techniques developed with this programming languages are used to a variety of real-world issues and it offers a well-defined framework for experimenters and developers to build and evaluate their models.
In the present study, 10-fold cross validation method was applied to measure the unbiased estimate of prediction algorithm. To compare the performance of different algorithms in predicting LOS, several evaluation metrics including accuracy, sensitivity, specificity, and mean Area Under the Curve (AUC) was calculated. During the evaluation process, the confusion matrix was provided (for two classes).
Confusion matrix is a table that demonstrates a beneficial way to assess the performance of a classification model (or "classifier"). Each row in a confusion matrix shows an actual class while each column represents a predicted class (Table 2).
The study was approved by the ethical committee board of Ilam University of Medical Sciences (Ethic code: IR.MEDILAM.REC.1399.294). In order to protect the privacy and confidentiality of patients, we concealed the unique identification information of all patients in the process of data collection and presentation.
Result and Dissection
Demographic and clinical characteristics
After applying the exclusion criteria, a total of 1225 patients met eligibilities (Fig. 1). Of 1225 hospitalized COVID-19 patients, 664(54.20%) were males and 561 (45.80%) were females and the median age of participants was 57.25 (interquartile 18-100). 170 (13.87%) were hospitalized in ICU and 1055 (86.13%) hospitalized in general wards. Of these, 1136 (92.75%) were recovered and 89 (7.25%) were deceased. Descriptive statistics for the 1225 records in this dataset are shown in Table 3.
Variables included in the ML models
The results of feature selection for determining the most important diagnostic criteria affecting COVID-19 hospital LOS based on the correlation coefficient at P<0.2 are demonstrated in Table 4.
After feature selection, the 20 diagnostic criteria were acquired for the determined correlation coefficient at P <0.2. These variables including age, creatinine, white-cell count, lymphocyte /neutrophil count, BUN, ASP, ALT, LDH, activated partial thromboplastin time, cough, hypertension, cardiovascular disorders, diabetes, dyspnea, oxygen therapy, pneumonia, GI complications, ESR, and C-reactive protein, were introduced as the most significant features (predictors) to predict hospital LOS.
Performance evaluation of models
The 10-fold cross-validation method was applied for running and evaluating the models, respectively. After pre-processing, we attempted to analyze the model performance by evaluation criteria including accuracy, sensitivity, specificity, and AUC-ROC. In this section, we assessed the effectiveness and efficiency of all classifiers with respect to running time, accuracy of classified cases and incorrectly classified cases (Table 5).
In order to better measure the actual performance of classifiers, 10 independent iterations of models were run. Finally, we evaluated the actual performance of our classifier in terms of standard deviation of accuracy, mean accuracy, mean specificity, and mean sensitivity (Table 6). Once the classifier algorithm is run, we can investigate how efficient (performance) it is. Therefore, we compared the accuracy measurement of the predictive model based on the aforementioned criteria for ANN, RBF, SVM, FNN, PNN, Pattern recognition network, and DT techniques.
Figure 2: Confusion matrix calculation for the selected ML algorithms
Figure 3: The ROC curve of the all ML algorithms
The results of comparing confusion matrix metrics and AUC-ROC of different classifiers are shown in Figure 2 and 3.
In the present study, the prediction model which has higher performance based on evaluation criteria such as accuracy, specificity, sensitivity, and running time, was chosen as the best algorithm and implemented in a Clinical Decision Support System (CDSS). Table 6 represents the different measures of performance for the ML models. According to the experimental results of the evaluation of selected ML models in 10-iterations, the SVM algorithm with the mean accuracy of 99.5%, mean specificity of 99.7%, mean sensitivity of 99.4%, and the standard deviation of 1.2., gained higher performance than the other techniques. The AUC-ROC for SVM was 99.8%. Besides, we observed that the SVM algorithm takes unto 0.07 second to build its model as the fastest, and unlike ANN takes about 1530 (s) that was the slowest.
This study intended to compare the accuracy and efficiency of selected ML techniques for COVID-19 in-hospital LOS prediction. The need for this research derived from the increasing demand for ML capabilities, as a scientific and objective measures to predict the LOS. During this time, most healthcare settings have experienced with capacity reduction due to high referral volumes and bed occupancy as well as prolonged hospitalization . The exact prediction of LOS can support the bed administration and projecting future requirements for optimal medical resource allocation [17, 21]. Predicting hospital bed request (as well as associated medical resources) offer key evidence for hospital staffing and resource planning decisions. It is significant for clinicians and health policymakers to make proper decisions for allocating of restricted resources [21, 42]. Using ML based prediction models (intelligence system) is proven to be useful for optimum LOS estimation. This led to reducing uncertainty and ambiguity by offering systematic and evidence based system for hospital resource utilization and care planning [11, 43].
For this purpose, several ML methods, including ANN, RBF, SVM, FNN, PNN, Pattern recognition network, and DT were fed by using the optimized predictor variables. Feature selection is a significant step to prepare and customize the data before feeding it to the ML classifiers . In this study, 53 primary features are reduced to 20 by using the correlation coefficient at the P-value< 0.2. In the bibliography, some studies have been undertaken to identify the key risk factors for COVID-19 hospital LOS [17, 21, 42]. The top clinical variables affecting longer LOS in reviewed studies included age (basic data), cardiovascular diseases and hypertension (underline diseases), fever and low oxygen saturation (manifestations), leukocytosis (immunological), pulmonary lesion (radiological), mechanical ventilation (oxygen therapy) and increased BUN. In general, high compliance was observed from the results of classifying and prioritizing variables in reviewed studies with the most common variables in the current study (Table 5).
So far, multiple studies have been conducted on the application of ML techniques to predict the LOS in hospitalized patients [11, 32, 34, 45-47]. It is proven that ML can be used with myriad applications for hospital LOS during the COVID-19 pandemic . However to the best of our knowledge, limited studies have been done on the use of ML techniques in the prediction of COVID-19 LOS . For example, Dan et al. (2020) in a retrospective study developed an SVM-based model to predict the COVID-19 length of ICU stay. Finally the results showed good performance for predicting the LOS with AUC-ROC of 91% and Mean Absolute Error (MAE) of 0.723 . Pei et al.  in their study assessed the performance of selected ML algorithms including K-nearest Neighbors(K-NN), Logistic Regression (LR) and Random Forest (RF) for prediction of patients’ LOS at hospital during COVID-19 pandemic with the accuracy of 0.3442, 0.3524 and 0.3541, respectively . Kabir  and Hijjry (2020) developed a prediction model to anticipate the LOS and the results presented that Back Propagation(BP) Neural Network with an accuracy of 92.58% outperformed all other ML models examined [44, 48]. Mahboub  utilized DT classifier for predicting COVID-19 patients’ hospital LOS. The experimental result showed that this algorithm with sensitivity of 96.5%, speciﬁcity of 87.8%, and accuracy of 96% has excellent performance . Kulkarin (2021) designed a Neural Network-Multi Layered Percepteron (ANN-MLP) based model for predicting prolonged LOS of patients with an accuracy of 90.87% . Sinha et al. (2021) also showed that ANN predictive performance for length of ICU stay in hospitalized patients with COVID-19 gained the best performance with Root Mean Square Error (RMSE) and MAE of 5.9451 and 4.6354, respectively . East’s  results showed that the model developed with ANN yielded the best performance to predict long LOS (AUC with 0.9760%) . Chiari et al.  compared the performance of two RF and Extra Trees regression algorithms for COVID-19 LOS prediction. The experimental results showed good performance for the LOS prediction with the accuracy of 98% and 95% respectively .
In this study, multiple ML-based prediction models including ANN, RBF, SVM, FNN, PNN, Pattern recognition network, and DT were trained and evaluated to determine the most optimal algorithm for predicting the COVID-19 LOS. Unlike previous studies where ANN techniques has better performance for forecasting LOS, the obtained results in 10 iteration execution of the selected ML algorithms in the present work showed that the SVM classifier with mean accuracy of 99.5%, mean specificity and sensitivity of 99.7% and 99.4%, respectively, having more predictive capabilities compared with other ML methods. The suggested model in this study can estimate the LOS of patients with optimal performance. It provides a better plan for hospital administrators, policy makers and clinicians in order to improve patient outcomes and quality of care especially in organizations with resource challenges. This led to decreasing ambiguity by offering scientific and evidence-based model for resource utilization and episode of care planning. But the model is inputted with a slight number of features (20 features), yet provides a precise calculation of the LOS. Moreover, the proposed model is simple, correct, and can be effortlessly implemented in clinical practice.
This study had some limitations that necessary to be recognized. First, we dealt with a retrospective dataset that may suffer from imbalanced, noisy, duplicates, and meaningless values, which may cause of prediction bias. Second, this study was conducted at a single center and only based on 1225 data, so confined the generalizability of the predictive model and may have affected the performance metrics of the proposed models. Moreover, we only used seven ML algorithms for prediction analyses. Finally, the selected dataset lacked some important clinical variables such as radiological indicators. In the future, the performance accuracy of our computational model will be improved if we test more ML techniques, at larger, multicenter and prospective dataset equipped with more qualitative data regarding more diverse variables.
Estimating the LOS of hospitalized patients with COVID-19 by offering an objective and evidence-based approach is crucial for effective bed management, better patient scheduling, proper staffing and customized resource allocation. In this study, at first, a statistical based feature selection method was applied for predicting COVID-19 in-hospital LOS. Then, we developed and evaluated several ML models to predict the LOS of COVID-19 patients using routine clinical data set. The evaluation of selected ML techniques performance demonstrated the suitability of these models, in particular the SVM model, for predicting in-hospital LOS. This model has the potential to augment informed decisions for effective management of COVID-19 patients. Besides, it can support the sharing of restricted hospital resources and enhance health care quality.
This article is extracted from a research project supported by the Ilam University of Medical Sciences (IR.MEDILAM.REC.1399.294). We also thank the Research Deputy of the Ilam University of Medical Sciences for financially supporting this project.
This research received financial support from Ilam University of Medical Sciences.
All authors contributed toward study designing, data mining, statistical analysis, reporting the results and agreed to be responsible for all the aspects of this work.
Conflict of Interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
HOW TO CITE THIS ARTICLE
Mohammad Reza Afrash, Hadi Kazemi-Arpanahi, Parvaneh Ranjbar, Raoof Nopour, Morteza Amraei, Mojgan Saki, Mostafa Shanbehzadeh . Predictive Modeling of Hospital Length of Stay in COVID-19 Patients Using Machine Learning Algorithms, J. Med. Chem. Sci., 2021, 4(5) 525-537