Optimizing Ensemble Learning Models with SMOTE-ENN for Early Stroke Detection in Imbalanced Clinical Datasets
Abstract
Stroke remains a leading cause of mortality and long-term disability worldwide, including in Indonesia, highlighting the urgent need for early risk identification. Machine learning models for stroke prediction often suffer from severe class imbalance, where stroke cases constitute only 4.9% of clinical datasets, leading to biased predictions that favor the majority class. This study evaluates three ensemble and kernel-based algorithms Random Forest, XGBoost, and Support Vector Machinecombined with two resampling strategies (SMOTE and SMOTE-ENN) using the Healthcare Stroke Dataset (5,110 records, 11 clinical attributes). To prevent data leakage, resampling was strictly applied within each training fold of 5-fold stratified cross-validation, while all evaluations were conducted on the original imbalanced test set. The results demonstrate that XGBoost integrated with SMOTE-ENN achieved the highest minority-class sensitivity, improving PR-AUC by 23.5% (0.1537 vs. 0.1244 with SMOTE alone), while detecting 24% of stroke cases (12 out of 50) in the test set. Although cross-validation results indicate strong class discrimination with AUC-ROC values above 0.98, the low PR-AUC reflects the operational challenge of extreme class imbalance and the inevitable trade-off between recall and precision, resulting in an increased number of false positives. Consequently, the proposed model is best positioned as a first-tier population screening tool that flags high-risk individuals for confirmatory clinical diagnostics, rather than as a standalone diagnostic system. The approach maintains computational efficiency (training time < 0.12 seconds) and substantially improves model stability, evidenced by a 73% reduction in cross-validation variance. These findings support the integration of hybrid resampling techniques with ensemble learning as a practical and scalable framework for early stroke risk screening in resource-constrained primary healthcare settings.
Downloads
References
V. L. Feigin et al., “World Stroke Organization (WSO): Global Stroke Fact Sheet 2022,” Int. J. Stroke, vol. 17, no. 1, pp. 18–29, 2022, doi: 10.1177/17474930211065917.
E. O. Rahayu, “Perbedaan Risiko Stroke Berdasarkan Faktor Risiko Biologi pada Usia Produktif,” J. Berk. Epidemiol., vol. 4, no. 1, pp. 113–125, 2016, doi: 10.20473/jbe.v4i1.113-125.
V. L. Feigin et al., “Global, regional, and national burden of stroke and its risk factors, 1990-2019: A systematic analysis for the Global Burden of Disease Study 2019,” Lancet Neurol., vol. 20, no. 10, pp. 1–26, 2021, doi: 10.1016/S1474-4422(21)00252-0.
K. Swain et al., “Enhancing Stroke Prediction Using LightGBM With SMOTE-ENN and Fine-Tuning: A Comprehensive Analysis,” Cureus J. Comput. Sci., 2024, doi: 10.7759/s44389-024-02268-y.
E. Dritsas and M. Trigka, “Stroke Risk Prediction with Machine Learning Techniques,” Sensors, vol. 22, no. 13, 2022, doi: 10.3390/s22134670.
Gullam Almuzadid and Egia Rosi Subhiyakto, “Stroke Risk Classification Using the Ensemble Learning Method of XGBoost and Random Forest,” J. Appl. Informatics Comput., vol. 9, no. 3, pp. 828–837, 2025, doi: 10.30871/jaic.v9i3.9528.
M. Z. Hossain Zamil, M. R. Islam, S. Debnath, M. T. Mia, M. A. Rahman, and A. K. Biswas, “Stroke Prediction on Healthcare Data Using SMOTE and Explainable Machine Learning,” ISDFS 2025 - 13th Int. Symp. Digit. Forensics Secur., pp. 1–6, 2025, doi: 10.1109/ISDFS65363.2025.11012059.
S. Alwaliyanto, G. Kurnia, I. Afrianty, and F. Syafria, “BULLETIN OF COMPUTER SCIENCE RESEARCH Penerapan Metode ADASYN Dalam Mengatasi Imbalanced Data Untuk Klasifikasi Penyakit Stroke Menggunakan Support Vector Machine,” Media Online), vol. 5, no. 4, pp. 532–541, 2025, doi: 10.47065/bulletincsr.v5i4.612.
W. P. Nurmawati, I. Indahwati, and F. M. Afendi, “Improving Stroke Detection with Hybrid Sampling and Cascade Generalization,” JUITA J. Inform., vol. 12, no. 1, p. 9, 2024, doi: 10.30595/juita.v12i1.19386.
R. Wijaya, F. Saeed, P. Samimi, A. M. Albarrak, and S. N. Qasem, “An Ensemble Machine Learning and Data Mining Approach to Enhance Stroke Prediction,” Bioengineering, vol. 11, no. 7, 2024, doi: 10.3390/bioengineering11070672.
N. Melnykova, Y. Patereha, S. Skopivskyi, M. Farion, S. Fedushko, and K. Drohomyretska, “Machine learning for stroke prediction using imbalanced data,” Sci. Rep., vol. 15, no. 1, pp. 1–20, 2025, doi: 10.1038/s41598-025-01855-w.
M. Kivrak, U. Avci, H. Uzun, and C. Ardic, “The Impact of the SMOTE Method on Machine Learning and Ensemble Learning Performance Results in Addressing Class Imbalance in Data Used for Predicting Total Testosterone Deficiency in Type 2 Diabetes Patients,” 2024. doi: 10.3390/diagnostics14232634.
D. Elreedy, A. F. Atiya, and F. Kamalov, “A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning,” Mach. Learn., vol. 113, no. 7, pp. 4903–4923, 2024, doi: 10.1007/s10994-022-06296-4.
B. Nemade, V. Bharadi, S. S. Alegavi, and B. Marakarkandy, “International Journal of INTELLIGENT SYSTEMS AND APPLICATIONS IN ENGINEERING A Comprehensive Review: SMOTE-Based Oversampling Methods for Imbalanced Classification Techniques, Evaluation, and Result Comparisons,” Orig. Res. Pap. Int. J. Intell. Syst. Appl. Eng. IJISAE, vol. 2023, no. 9s, 2023, [Online]. Available: www.ijisae.org
M. Muntasir Nishat et al., “A Comprehensive Investigation of the Performances of Different Machine Learning Classifiers with SMOTE-ENN Oversampling Technique and Hyperparameter Optimization for Imbalanced Heart Failure Dataset,” Sci. Program., vol. 2022, no. Cvd, 2022, doi: 10.1155/2022/3649406.
U. Hasanah, A. M. Soleh, and K. Sadik, “Effect of Random Under sampling, Oversampling, and SMOTE on the Performance of Cardiovascular Disease Prediction Models,” J. Mat. Stat. dan Komputasi, vol. 21, no. 1, pp. 88–102, 2024, doi: 10.20956/j.v21i1.35552.
J. Wiens et al., “Do no harm: a roadmap for responsible machine learning for health care,” Nat. Med., vol. 25, no. 9, pp. 1337–1340, 2019, doi: 10.1038/s41591-019-0548-6.
A. Rajkomar, J. Dean, and I. Kohane, “Machine Learning in Medicine,” N. Engl. J. Med., vol. 380, no. 14, pp. 1347–1358, 2019, doi: 10.1056/nejmra1814259.
F. Rivellese et al., “Rituximab versus tocilizumab in rheumatoid arthritis: synovial biopsy-based biomarker analysis of the phase 4 R4RA randomized trial,” Nat. Med., vol. 28, no. 6, pp. 1256–1268, 2022, doi: 10.1038/s41591-022-01789-0.
A. Fernández, S. García, F. Herrera, and N. V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” J. Artif. Intell. Res., vol. 61, pp. 863–905, 2018, doi: 10.1613/jair.1.11192.
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel Optimizing Ensemble Learning Models with SMOTE-ENN for Early Stroke Detection in Imbalanced Clinical Datasets
Pages: 2300-2311
Copyright (c) 2026 Dina Nurmala, Angga Bayu Santoso

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).





















