Today, Coronavirus 2019 (COVID-19), also known as Acute Respiratory Syndrome, is commonly dubbed as SARS-COV-2, which, in turn, has affected millions of people worldwide. A diverse and multidimensional clinical picture determines the disease and its incubation period is 14 days, and on average, symptoms appear within 4-5 days after exposure [1-3]. The disease manifestation can be ranged from asymptomatic to severe pneumonia, acute respiratory distress syndrome (ARDS), and even death . Reportedly, approximately 80% of COVID-19-stricken patients suffer from asymptomatic or mild to moderate symptoms, and about 15% of patients present with severe symptoms and are referred for critical care units [5, 6]. It is reported that the patient's condition becomes critical and may require hospitalization in intensive care units(ICU) in 5% of cases. Despite widespread vaccination in the population, the prevalence of new emerging disease cases characterized by much more contagious species is still ever-increasing . It is said that aging, male gender, obesity (BMI> 40), underlying diseases, and hypoxemia are essential factors that considerably exacerbate the condition. The severe or acute stage of the disease is characterized by severe complications such as ARDS, cytokine syndrome, and multiple-system organ dysfunction (MOF) [5, 6]. In the meantime, many hospitalized patients with advanced stages need drug therapy to prevent patient deterioration and reduce respiratory complications . So far, several drugs have been proposed to avoid severe complications and mortality caused by COVID-19. However, unfortunately, no approved drug therapy has been discovered to treat COVID-19 . Many drugs have lacked significant effects due to the complex, unknown and mutable nature of the disease . On the other hand, physicians have reported problems in predicting the COVID-19 drug adverse effects . This requirement is more pronounced, especially concerning the increase number of drug adverse reactions and current unpredictability of the disease behavior and courses [11,12]. Therefore, designing predictive models of COVID-19 adverse effects possibility can increase drug therapy quality by reducing severe drug allergies and interactions.
To deal with this problem, the design and implementation of clinical decision support systems (CDSS) based on artificial intelligence (AI) will be of great importance, accordingly [13-16]. Machine learning (ML) algorithms, a subset of artificial intelligence, are extensively utilized to screen, diagnose, prognosis, and predict COVID-19 outcomes [17, 18]. These mathematics models could quickly combine and analyze large volumes of data. Besides, ML algorithms are applied to generate predictive models that can be used to support and improve clinical decision-making for a wide range of outcomes [19-22]. It should be noted that a great deal of ML-based models was developed to estimate the COVID-19 severity, patient deterioration [17,23], ICU admission [23-27], and mortality [24,25,28-34] in previous studies. Unfortunately, few studies have been conducted on applying ML techniques to predict adverse drug effects among COVID-19 hospitalized patients.
Therefore, the present study was performed to establish and compare several ML-based predictive models to estimate adverse drug reactions among COVID-19 hospitalized patients.
Material and Methods
This study aimed to predict the adverse drug effect among COVID-19 hospitalized patients and was carried out in four stages as follow:
In this study, a COVID-19 hospital-based Electronic Medical Record (EMR) database from Mostafa Khomeini medical center, affiliated with Ilam University of medical sciences, West of Iran, was investigated retrospectively, February 9, 2020, to July 20, 2021. During this period, a total of 2854 suspected cases with COVID-19 had been referred to this center, of whom 853 cases were detected as positive COVID-19, and their clinical data were recorded in the EMR database based on six categories and 60 features. These features were categorized as laboratory (25 variables), demographic (five variables), clinical manifestations (18 variables), prescription (four variables), history of diseases (four variables), epidemiologic (two variables), and hospitalization (two variables), which are depicted in Table 1. The output variable was classified into two groups: hospitalized COVID-19 patients with adverse drug effects (code 1) and those who hadn't these conditions (code 0).
Dataset Preparing and Analyzing
In this study, first, two experienced health information managers (M.SH and R.N) investigated all the case records, consulting two infectious and internal specialists regarding the quantitative and qualitative attributes of medical documentation and appropriateness for statistical analysis. Therefore, the samples owning more than 70% missing values that had no substantial role in statistical analysis were excluded from the study. In the next step, for the cases with less than 70% missing values, we used the two methods of the replacement by averaged values and K-nearest neighborhood (KNN) with specific amounts of K in Rapid Miner V 7.1 software for embedding the missing quantitative and qualitative values, respectively. Finally, to obtain the most relevant features as the best variables for drug adverse effect prediction and reduce the dataset dimensions, we used the feature selection process in this respect. Due to the enormous amount of data with many unrelated attributes in databases, this process is essential in data science and data mining applications because of its capability to gain pertinent features and eliminate useless data elements (35-37). Some advantages of this process can be enumerated as 1- removing irrelevant attributes, 2- clustering dataset using more related features, 3- augmenting algorithms performance, 4- reducing training time, more understandable data mining results, and 5- preventing form overfitting (38-40). In this study, the independence Chi-square test (χ2) has been considered for determining the highly associated features with the dependent variable (adverse drug effect). The P<0.05 was considered the statistically significant level in this regard.
Decision tree models
For building the predictive models for drug adverse effect prediction among hospitalized COVID-19 patients, we have used four selected decision tree algorithms because of the high usage of these algorithms in recent articles with the best performance as follows:
J-48: This algorithm is also known as C4.5 and is considered as the expansion of the ID3 decision tree with the capabilities, such as perpending the missing values for classifying the samples, adjusting the tree size for pruning using the confidence factor, extracting the rules, and organizing the selections having continuous numerical value ranges. This algorithm also possesses a pleasant balance between accuracy and sample classification capabilities using the pruning characteristics in which samples are classified completely until the tree is complete. In other words, in this algorithm, the overfitting will be prevented, and rules are generated using the specialized cognizance generated using the dataset itself. The J-48 decision tree algorithm builds the tree using the entropy concept. Suppose that the training dataset includes samples (S=S1+S2+S3+…. +Sn) and every instance has a p-dimensional vector (X1, I, X2, I, X3, I,..., XP, I), so in this regard, XJ demonstrates the feature that the Si sample will be prolapsed. This algorithm uses elements with higher entropy differences than others for tree splitting. So it has samples categorizing capability with the high distinction that will be existed between the subtrees when having the classified samples with the highest frequency based on different output classes in their leave nodes [41-43].
Random-Forest: The random forest is a hybrid decision tree algorithm including various subtrees as classifiers with specified depths and nodes. This algorithm has reasonable flexibility for making the decision trees utilizing the multiple features for splitting the trees randomly. The accuracy of the random forest decision tree algorithm depends on each subtree's accuracy in predicting the classes. Its performance also depends on the number of the subtrees' votes that existed between them; in other words, the model's performance will be considered the performance of the majority of the subtrees. This decision tree has common capabilities in classifying the samples with high performance, embedding the outliers and noisy values in features, and preventing the overfitting of the algorithm [44-47].
Decision stump: The decision stump owns only one layer, including the root node directly connected with the leaves node, in contrast with other decision tree algorithms having three-node layers (root, internal, and leave), and also the splitting process will be stopped after the first split in this structure. The most common application of this classifier is related to mining in large dimension datasets. Still, they also can be utilized in smaller dimensions of the dataset with binary splitting. It is also known as single-rule because it predicts the output class with just the values in one variable as a predictor [48-50].
Hoeffding tree: This tree is an additive decision algorithm common for an extensive dataset. This algorithm is a primitive decision tree algorithm with inflexibility in data set dimension variations in massive datasets. This algorithm possesses a potential advantage for selecting the highly differentiating attribute with finite Hoeffding. The Hoeffding decision tree algorithm is based on the Hoeffding limitation; in other words, by considering enough attributes occurrence, the range of random variable changeability is the predictable amount known as the Hoeffding limitation. In this decision tree algorithm, the samples categorization process can be done using a specified number with a predetermined fitness [51-53].
Analyzing decision tree algorithm's performance
After developing selected decision tree algorithms using the most common technical parameters for recognizing the best algorithm to predict adverse drug reactions among hospitalized COVID-19 patients, the performance of these algorithms has been compared and evaluated. First, the confusion matrix (Table 2) was utilized to compare the algorithm's performance and sample classification strength. True Positive (TP) and True Negative (TN) representing hospitalized COVID-19 patients with the drug side effects (P) and didn't have (N) and were correctly classified by algorithms, respectively. False Positive (FN) and False Negative (FN) are the positive and negative cases incorrectly classified by algorithms, respectively. Also, based on the confusion matrix, the TP-Rate, FP-Rate, Precision, Recall, F-Measure, and Area under the Receiver Operator Characteristics (ROC) of each selected decision tree algorithm have been calculated for measuring and evaluating the decision tree algorithm's performance. Also, the ten-fold cross-validation has been considered for embedding errors when measuring the algorithms' performance. Finally, the best decision tree algorithm has been obtained using these different evaluation criteria for predicting adverse drug effects and also was drawn. Afterward, the most important rules with the structure of IF-THEN have been extracted from the tree and then interpreted as the essential clinical knowledge for predicting adverse drug effects among hospitalized COVID-19 patients with the most frequency of classified samples.
Result and Dissection
After excluding the samples that owned 70% or higher missing, noisy, and abnormal values in their attributes and applying the exclusion criteria such as records belonged to patients less than 18 years old, discharged, or died in emergency departments, 371 records were excluded from the study finally. Therefore, 482 records have remained for statistical analysis. Among them, 176 (36.5%) records have belonged to hospitalized COVID-19 patients with drug side effects, and 306 (63.5%) of them were associated with patients who hadn't them. The 227 (47.1%) records have belonged to men with the mean age of 50 ± 12.5 years, and 255 (52.9%) of them were associated with women with the mean age 52±11.7 years. In Table 2, the results of using the independence Chi-square test (χ2) for determining the most important factors predicting the drug side/ adverse effect among hospitalized COVID-19 patients at P< 0.05 have been represented.
Based on the information represented in Table 3, the 18 variables obtained the specific Chi-square at P<0.05 as the final predictors. Also, the 15 variables had the χ2 at P<0.01 in this respect. We obtained the hospitalization length (χ2=89.758) (P<0.01), white blood cell count (χ2=53.154) (P<0.01), and activated partial thromboplastin time (118.196) (P<0.01) as the best factors at P<0.01. They were considered the best factors were predicting adverse drug effects among hospitalized COVID-19 patients. The results of classifying the samples for selected decision tree algorithms using the confusion matrix are presented in Table 3.
Based on the information given in Table 4, the Decision stump with TN=306 and FP=0 by classifying all the negative samples (non-affected cases) obtained better performance than other decision tree algorithms. Also, the Random Forest and J-48 decision tree algorithms with TN=300 and FP=6 have gotten a pleasant performance with a meager difference rather than the Decision stump classifier. On the contrary, the J-48 decision tree algorithm with TP=156 and FN=20 could best classify the positive cases (having drug adverse effects) and performed considerably better than other algorithms. The results of comparing the decision tree algorithms performance using different evaluation criteria are demonstrated in Figure 1
Figure 1: Different indicators of algorithms performance evaluation
The results of comparing the decision tree algorithms performance using Figure 1 demonstrated that the J-48 decision tree algorithm with TP-Rate=94.6%, FP-Rate=7.9%, Precision=94.7%, Precision=94.7%, Recall=94.6%, and F-Measure=94.6% acquired the best performance generally. Also, the Hoeffding decision tree algorithm with TP-Rate= 74.9%, Precision=75.1%, Recall=74.9%, and F-Measure=73% had the worst performance. In Figure 2, the ROC of all selected decision tree algorithms is depicted (the vertical and horizontal vertices demonstrate the TPR and FPR vertices, respectively).
Figure 2: The ROC of all selected decision tree algorithms
Comparing selected decision tree algorithms performance using the ROC diagram demonstrated that the J-48 algorithm with TP-Rate=94.6%, FP-Rate=7.9%, Precision=94.7%, Precision=94.7%, Recall=94.6%, F-Measure=94.6%, and AUC=0.957 had the best capability than others and was considered as the best decision model for predicting the adverse drug effect among hospitalized COVID-19 patients. In conclusion, the J-48 decision tree algorithm has been drawn, and the essential clinical knowledge with the IF-THEN structure was extracted and interpreted with more detail. We pruned the tree using the confidence factor to reduce the tree's size for better understanding and extracting the clinical knowledge from the decision tree. We decreased it to 0.1 (Figure 3). The most important technical features for building the decision tree were batch size=100, binary splits=false, collapse tree=true, confidence factor=0.1, the minimum number of instances per leave=2, number of folds=3, number of seeds=1, use Laplace=false.
Some of the most important clinical rules with the most classified samples have been brought in as follow:
- IF (Activated partial thromboplastin time <=31) THEN (drug side/ adverse effect =1),
- IF (Activated partial thromboplastin time >31) (Length of hospitalization <=6) (Loss of taste =No) THEN (drug side/ adverse effect =0),
- IF (Activated partial thromboplastin time >31) (Length of hospitalization >6) (White cell count > 8700) THEN (drug side/ adverse effect =1).
In the J-48 decision tree algorithm, activated partial thromboplastin time was considered the best variable for predicting drug side/ adverse effects with the highest info gain. Therefore, it was placed at the root node. Rule1 states that in hospitalized COVID-19 patients having activated partial thromboplastin time less than 31 seconds, the hospitalized COVID-19 presumably had the adverse drug effects, and 64 samples of the study have confirmed this pattern. In Rule 2, hospitalized COVID-19 patients with more than 31 Activated partial thromboplastin time seconds hospitalized less than six days and didn't lose their taste sensation, hadn't drug adverse effects with the probability of 87%. According to Rule 3, the negative drug effect existed among the 39 hospitalized COVID-19 patients. They have more than 31 activated partial time seconds and more than six days of hospitalization, and more than 8700 counts of white cells.
Given the wide range of clinical manifestations of COVID-19, it is crucial to develop intelligent models for predicting the likelihood of adverse drug effects using ML techniques . Therefore, we examined four selected decision tree ML-based models on important parameters obtained from the independence test of Chi-square. The decision tree models used here included the J-48, Random forest, Hoeffding tree, and Decision stump, applied upon 482 confirmed RT-PCR COVID-19 patients. Finally, our results showed that the J-48 classifier performed better than the other selected ML algorithms with an F-score of = 94.6% and AUC = 0.957. Necessarily, treating patients with COVID-19 requires informed and scientific drug prescription, especially when hospitals are faced with an increasing number of patients and a shortage of care facilities [55, 56]. In this regard, physicians state that they encounter problems in predicting the likelihood of adverse drug effects .
Figure 3: The pruned J-48 decision tree algorithm
To deal with this problem thus, the design and implementation of CDSS based on AI will be precious for the optimal drug prescription and support for clinical decisions [17, 58]. For instance, ML-equipped CDSSs could assist clinicians in making clinical decisions by alerting caregivers and recommending interventions based on objective and generalizable empirical data . This study showed that ML algorithms, especially the J-48 classifier, may predict the drug side and adverse effects in patients hospitalized with COVID-19.
To date, some studies have evaluated the application of ML techniques in predicting the poor and adverse outcomes of drug prescription among COVID-19. For example, Ganguli et al. (2021) developed an intelligent system based on ML algorithms to predict medication error in hospitalized patients with COVID-19 using 1023 patient data. They reported the best performance for the J-48 algorithm with AUC = 0.84 . Besides, Behery (2021) analyzed the data of 5643 negative and positive samples of COVID-19 to predict drug allergy in individuals using selected ML models. The results showed that the J-48 algorithm represented an acceptable detection power with 86% accuracy . Accordingly, Lv et al. (2021) evaluated the performance of four ML algorithms to predict the adverse drug effects using information gathered from 3841 COVID-19 cases. Finally, the J-48 model with AUC = 0.92 was introduced as the most suitable algorithm . Siqueira et al. (2021) evaluated four ML algorithms to predict the likelihood of patient deterioration of patients with COVID-19 after drug therapy. Ultimately, the J-48 model with the best AUC of 0.92 was introduced as the superior algorithm . In the present study, the results showed that the J-48 decision tree algorithm with F-Score = 64.6% and AUC = 0.957 enjoys the best capacity for early prediction of drug side and adverse effects in hospitalized patients with COVID-19. The high predictive indices reported by our J-48 model showed that this algorithm can differentiate between high-risk and low-risk patients.
The main advantage of the present study is that we predicted the possibility of prescription side effects based on the most appropriate variables derived from the independence test of Chi-square. Nevertheless, the present study also had some limitations. First, we analyzed a retrospective and single-center data set with a limited sample size. Second, continuous changes in some crucial variables should be thoroughly observed to accurately identify patients at higher risk of poor outcomes on time. Finally, the selected data set lacked clinically essential variables such as radiological indicators. In the future, if one intends to develop more ML techniques in a more significant, multicenter, and futuristic data set equipped with more quantitative and reliable data, the accuracy of the model performance and its generalizability will increase accordingly.
In this research study, the data recorded in the selected hospital database were analyzed. Then, ML models were developed and tested to predict the possible drug side and adverse effects considering 18 clinical features. The results revealed the acceptable performance of the J-48 decision tree model. Therefore, the developed predictive models can be demanding in providing quality of care, diminishing the workload of the care team, minimizing prescription errors, increasing the quality of care, and rendering patient-centered treatments.
This article is extracted from a research project supported by the Ilam University of Medical Sciences (Ethical code: IR.MEDILAM.REC.1399.294). We also thank the Ilam University of Medical Sciences Research Deputy for financially supporting this project.
This research received financial support from Ilam University of Medical Sciences.
All authors contributed toward study designing, data mining, statistical analysis, reporting the results and agreed to be responsible for all the aspects of this work.
Conflict of Interest
The author(s) declared no potential conflicts of interest concerning the research, authorship,
and/or publication of this article.
HOW TO CITE THIS ARTICLE
Raoof Nopour, Mehrnaz Mashoufi, Morteza Amraei, Nahid Mehrabi, Alireza Mohammadnia, Abdollah Mahdavi, Nader Mirani, Mojgan Saki, Mostafa Shanbehzadeh. Performance Analysis of Selected Decision Tree Algorithms for Predicting Drug Adverse Reaction Among COVID-19 Hospitalized Patients, J. Med. Chem. Sci., 2022, 5(4) 537-549