Prediksi Diabetes Mellitus dengan Ensemble Gradient Boosting dan Advanced Feature Engineering
Abstract
Diabetes mellitus represents a metabolic disease that constitutes a global health challenge with continuously increasing prevalence rates. Early detection through automated prediction systems can help reduce complications and treatment costs. This study develops a diabetes mellitus prediction system using an ensemble gradient boosting approach optimized with advanced feature engineering. The research dataset combines 768 Pima Indians samples with 5,000 samples from diabetes prediction dataset, resulting in 5,768 total data points subsequently balanced using ADASYN technique. Feature engineering process transforms 8 original features into 25 predictive features encompassing diabetes risk scores, BMI categories, age groups, and glucose categories. Three gradient boosting algorithms (XGBoost, LightGBM, CatBoost) along with ensemble voting classifier were optimized using Optuna framework with Tree-structured Parzen Estimator. Evaluation employed accuracy, precision, recall, F1-score, and ROC-AUC metrics through 5-fold cross validation. Results demonstrate LightGBM achieving optimal performance with 97.14% accuracy and 0.9976 ROC-AUC, followed by CatBoost (97.14%, 0.9973) and XGBoost (96.45%, 0.9971). Feature importance analysis identified DiabetesPedigreeFunction, Pregnancies, and SmokingHistory as key predictors. The developed model can be implemented as a diabetes screening system in primary healthcare facilities
Downloads
References
“Diabetes.” Accessed: Jul. 09, 2025. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/diabetes
D. Magliano and E. J. Boyko, IDF Diabetes Atlas. International Diabetes Federation, 2021. [Online]. Available: https://www.ncbi.nlm.nih.gov/books/NBK581934/
Z. Rahman, U. Y. Atrie, W. Pujiati, and E. Ernawati, “Edukasi Upaya Peningkatan Kualitas Hidup Pasien Diabetes Melitus Tipe 2,” Jurnal Medika: Medika, vol. 4, no. 3, pp. 313–318, 2025, doi: https://doi.org/10.31004/c1ek6b34.
M. K. Hasan, M. A. Alam, D. Das, E. Hossain, and M. Hasan, “Diabetes prediction using ensembling of different machine learning classifiers,” IEEE Access, vol. 8, pp. 76516–76531, 2020, doi: 10.1109/ACCESS.2020.2989857.
R. Rastogi and M. Bansal, “Diabetes prediction model using data mining techniques,” Measurement: Sensors, vol. 25, p. 100605, Feb. 2023, doi: 10.1016/J.MEASEN.2022.100605.
A. Brahmandjati, A. Mizwar A. Rahim, and F. Asharudin, “Optimasi Prediksi Diabetes Dengan Algoritma XGBoost Dan Teknik Preprocessing Data,” LOGIC : Jurnal Ilmu Komputer dan Pendidikan, vol. 3, no. 1, pp. 116–125, Jan. 2025, [Online]. Available: https://journal.mediapublikasi.id/index.php/logic/article/view/4963
M. R. Mubarok, M. Muliadi, and R. Herteno, “Hyper-parameter Tuning pada XGBOOST Untuk Prediksi Keberlangsungan Hidup Pasien Gagal Jantung,” KLIK-KUMPULAN JURNAL ILMU KOMPUTER, vol. 9, no. 2, pp. 391–401, 2022, doi: http://dx.doi.org/10.20527/klik.v9i2.484.
S. Ahmad, M. Z. Asghar, F. M. Alotaibi, and Y. D. Alotaibi, “RETRACTED ARTICLE: Diagnosis of cardiovascular disease using deep learning technique,” Soft comput, vol. 27, no. 13, pp. 8971–8990, 2023, doi: 10.1007/s00500-022-07788-0.
N. Sneha and T. Gangil, “Analysis of diabetes mellitus for early prediction using optimal features selection,” J Big Data, vol. 6, no. 1, pp. 1–19, 2019, doi: https://doi.org/10.1186/s40537-019-0175-6.
M. Maniruzzaman et al., “Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm,” Comput Methods Programs Biomed, vol. 152, pp. 23–34, 2017, doi: https://doi.org/10.1016/j.cmpb.2017.09.004.
A. Géron, Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. “ O’Reilly Media, Inc.,” 2022.
J. M. Rudd, “An empirical study of downstream analysis effects of model pre-processing choices,” Open J Stat, vol. 10, no. 5, pp. 735–809, 2020, doi: 10.4236/ojs.2020.105046.
H. Marlisa, N. Satyahadewi, N. Imro’ah, and N. Debataraja, “Application Of Adasyn Oversampling Technique On K-Nearest Neighbor Algorithm,” BAREKENG: Jurnal Ilmu Matematika dan Terapan, vol. 18, no. 3, Jul. 2024, doi: 10.30598/barekengvol18iss3pp1829-1838.
A. H. Putra and A. Salam, “A Comparative Performance of SMOTE, ADASYN and Random Oversampling in Machine Learning Models on Prostate Cancer Dataset,” Journal of Applied Informatics and Computing, vol. 9, no. 3, pp. 603–610, Jun. 2025, doi: 10.30871/jaic.v9i3.9308.
O. Björneld, M. Carlsson, and W. Löwe, “Case study - Feature engineering inspired by domain experts on real world medical data,” Intell Based Med, vol. 8, p. 100110, Jan. 2023, doi: 10.1016/J.IBMED.2023.100110.
T. O. Omotehinwa, D. O. Oyewola, and E. G. Moung, “Optimizing the light gradient-boosting machine algorithm for an efficient early detection of coronary heart disease,” Informatics and Health, vol. 1, no. 2, pp. 70–81, Sep. 2024, doi: 10.1016/J.INFOH.2024.06.001.
M. Nalluri, M. Pentela, and N. R. Eluri, “A scalable tree boosting system: XG boost,” Int. J. Res. Stud. Sci. Eng. Technol, vol. 7, no. 12, pp. 36–51, 2020, doi: doi.org/10.22259/2349-476X.0712005.
R. G. Farahani, A. Zarrabi, and P. Ghazanfari, “A Report on CatBoost: unbiased boosting with categorical features,” 2025, doi: https://doi.org/10.13140/RG.2.2.30029.96485.
Z.-H. Zhou, Ensemble methods: foundations and algorithms. CRC press, 2025.
P. Srinivas and R. Katarya, “hyOPTXg: OPTUNA hyper-parameter optimization framework for predicting cardiovascular disease using XGBoost,” Biomed Signal Process Control, vol. 73, p. 103456, Mar. 2022, doi: 10.1016/J.BSPC.2021.103456.
A. R. M. Rom, N. Jamil, and S. Ibrahim, “Multi objective hyperparameter tuning via random search on deep learning models,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 22, no. 4, pp. 956–968, 2024, doi: doi.org/10.12928/telkomnika.v22i4.25847.
D. M. W. Powers, “Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation,” arXiv preprint arXiv:2010.16061, 2020, doi: https://doi.org/10.48550/arXiv.2010.16061.
T. C. F. Polo and H. A. Miot, “Use of ROC curves in clinical and experimental studies,” 2020, SciELO Brasil. doi: https://doi.org/10.1093/ije/dyz274.
S. Raschka, “Model evaluation, model selection, and algorithm selection in machine learning,” arXiv preprint arXiv:1811.12808, 2018.
H. Naz and S. Ahuja, “Deep learning approach for diabetes prediction using PIMA Indian dataset,” J Diabetes Metab Disord, vol. 19, no. 1, pp. 391–403, Jun. 2020, doi: 10.1007/S40200-020-00520-5.
M. Maniruzzaman, M. J. Rahman, B. Ahammed, and M. M. Abedin, “Classification and prediction of diabetes disease using machine learning paradigm,” Health Inf Sci Syst, vol. 8, no. 1, p. 7, 2020, doi: https://doi.org/10.1007/s13755-019-0095-z.
Y. Zou et al., “Development and internal validation of machine learning algorithms for end-stage renal disease risk prediction model of people with type 2 diabetes mellitus and diabetic kidney disease,” Ren Fail, vol. 44, no. 1, pp. 562–570, 2022, doi: DOI: 10.1080/0886022X.2022.2056053.
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel Prediksi Diabetes Mellitus dengan Ensemble Gradient Boosting dan Advanced Feature Engineering
Pages: 1222-1233
Copyright (c) 2025 Daniswara Tegar Ramadhan, Feri Agustina

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).





















