Prediction of adverse drug reaction using machine learning and deep learning based on an imbalanced electronic medical records dataset
Early prediction of adverse drug reaction (ADR) is crucial in clinical research. The development of electronic medical record (EMR) provides an excellent resource for retrospective studies to extract samples and establish models that can be used for prediction of clinical deterioration. However, classical statistical models like multivariate logistic regression (LR) may result in unreliable predictions when handling unbalanced datasets. To develop a trustworthy model on unbalanced ADR data, we first transformed the EMR including medical notes into numeric variables. Then we introduced support vector machine (SVM), random forest (RF), AdaBoost, XGBoost, and artificial neural network (ANN) to deal with the challenge of high dimensionality. Furthermore, we utilized the ensembling approach to tackle data imbalance. Finally, we analyzed potential model mechanisms to provide interpretability and compared methods from the perspective of procedure elapsed time. The results showed ensembling contributed considerable improvement in prediction ability of various machine intelligence models. Compared with the baseline, RF, AdaBoost and XGBoost presented superiority, and ANN without fine-tuning showed similar competence. The results of this study demonstrated the great potential of machine learning models in medical domain.