Prediksi Diabetes Mellitus dengan Ensemble Gradient Boosting dan Advanced Feature Engineering


  • Daniswara Tegar Ramadhan * Mail Universitas Dian Nuswantoro, Semarang, Indonesia
  • Feri Agustina Universitas Dian Nuswantoro, Semarang, Indonesia
  • (*) Corresponding Author
Keywords: Diabetes Prediction; Gradient Boosting; Feature Engineering; Hyperparameter Optimization; Machine Learning

Abstract

Diabetes mellitus represents a metabolic disease that constitutes a global health challenge with continuously increasing prevalence rates. Early detection through automated prediction systems can help reduce complications and treatment costs. This study develops a diabetes mellitus prediction system using an ensemble gradient boosting approach optimized with advanced feature engineering. The research dataset combines 768 Pima Indians samples with 5,000 samples from diabetes prediction dataset, resulting in 5,768 total data points subsequently balanced using ADASYN technique. Feature engineering process transforms 8 original features into 25 predictive features encompassing diabetes risk scores, BMI categories, age groups, and glucose categories. Three gradient boosting algorithms (XGBoost, LightGBM, CatBoost) along with ensemble voting classifier were optimized using Optuna framework with Tree-structured Parzen Estimator. Evaluation employed accuracy, precision, recall, F1-score, and ROC-AUC metrics through 5-fold cross validation. Results demonstrate LightGBM achieving optimal performance with 97.14% accuracy and 0.9976 ROC-AUC, followed by CatBoost (97.14%, 0.9973) and XGBoost (96.45%, 0.9971). Feature importance analysis identified DiabetesPedigreeFunction, Pregnancies, and SmokingHistory as key predictors. The developed model can be implemented as a diabetes screening system in primary healthcare facilities

Downloads

Download data is not yet available.

References

“Diabetes.” Accessed: Jul. 09, 2025. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/diabetes

D. Magliano and E. J. Boyko, IDF Diabetes Atlas. International Diabetes Federation, 2021. [Online]. Available: https://www.ncbi.nlm.nih.gov/books/NBK581934/

Z. Rahman, U. Y. Atrie, W. Pujiati, and E. Ernawati, “Edukasi Upaya Peningkatan Kualitas Hidup Pasien Diabetes Melitus Tipe 2,” Jurnal Medika: Medika, vol. 4, no. 3, pp. 313–318, 2025, doi: https://doi.org/10.31004/c1ek6b34.

M. K. Hasan, M. A. Alam, D. Das, E. Hossain, and M. Hasan, “Diabetes prediction using ensembling of different machine learning classifiers,” IEEE Access, vol. 8, pp. 76516–76531, 2020, doi: 10.1109/ACCESS.2020.2989857.

R. Rastogi and M. Bansal, “Diabetes prediction model using data mining techniques,” Measurement: Sensors, vol. 25, p. 100605, Feb. 2023, doi: 10.1016/J.MEASEN.2022.100605.

A. Brahmandjati, A. Mizwar A. Rahim, and F. Asharudin, “Optimasi Prediksi Diabetes Dengan Algoritma XGBoost Dan Teknik Preprocessing Data,” LOGIC : Jurnal Ilmu Komputer dan Pendidikan, vol. 3, no. 1, pp. 116–125, Jan. 2025, [Online]. Available: https://journal.mediapublikasi.id/index.php/logic/article/view/4963

M. R. Mubarok, M. Muliadi, and R. Herteno, “Hyper-parameter Tuning pada XGBOOST Untuk Prediksi Keberlangsungan Hidup Pasien Gagal Jantung,” KLIK-KUMPULAN JURNAL ILMU KOMPUTER, vol. 9, no. 2, pp. 391–401, 2022, doi: http://dx.doi.org/10.20527/klik.v9i2.484.

S. Ahmad, M. Z. Asghar, F. M. Alotaibi, and Y. D. Alotaibi, “RETRACTED ARTICLE: Diagnosis of cardiovascular disease using deep learning technique,” Soft comput, vol. 27, no. 13, pp. 8971–8990, 2023, doi: 10.1007/s00500-022-07788-0.

N. Sneha and T. Gangil, “Analysis of diabetes mellitus for early prediction using optimal features selection,” J Big Data, vol. 6, no. 1, pp. 1–19, 2019, doi: https://doi.org/10.1186/s40537-019-0175-6.

M. Maniruzzaman et al., “Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm,” Comput Methods Programs Biomed, vol. 152, pp. 23–34, 2017, doi: https://doi.org/10.1016/j.cmpb.2017.09.004.

A. Géron, Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. “ O’Reilly Media, Inc.,” 2022.

J. M. Rudd, “An empirical study of downstream analysis effects of model pre-processing choices,” Open J Stat, vol. 10, no. 5, pp. 735–809, 2020, doi: 10.4236/ojs.2020.105046.

H. Marlisa, N. Satyahadewi, N. Imro’ah, and N. Debataraja, “Application Of Adasyn Oversampling Technique On K-Nearest Neighbor Algorithm,” BAREKENG: Jurnal Ilmu Matematika dan Terapan, vol. 18, no. 3, Jul. 2024, doi: 10.30598/barekengvol18iss3pp1829-1838.

A. H. Putra and A. Salam, “A Comparative Performance of SMOTE, ADASYN and Random Oversampling in Machine Learning Models on Prostate Cancer Dataset,” Journal of Applied Informatics and Computing, vol. 9, no. 3, pp. 603–610, Jun. 2025, doi: 10.30871/jaic.v9i3.9308.

O. Björneld, M. Carlsson, and W. Löwe, “Case study - Feature engineering inspired by domain experts on real world medical data,” Intell Based Med, vol. 8, p. 100110, Jan. 2023, doi: 10.1016/J.IBMED.2023.100110.

T. O. Omotehinwa, D. O. Oyewola, and E. G. Moung, “Optimizing the light gradient-boosting machine algorithm for an efficient early detection of coronary heart disease,” Informatics and Health, vol. 1, no. 2, pp. 70–81, Sep. 2024, doi: 10.1016/J.INFOH.2024.06.001.

M. Nalluri, M. Pentela, and N. R. Eluri, “A scalable tree boosting system: XG boost,” Int. J. Res. Stud. Sci. Eng. Technol, vol. 7, no. 12, pp. 36–51, 2020, doi: doi.org/10.22259/2349-476X.0712005.

R. G. Farahani, A. Zarrabi, and P. Ghazanfari, “A Report on CatBoost: unbiased boosting with categorical features,” 2025, doi: https://doi.org/10.13140/RG.2.2.30029.96485.

Z.-H. Zhou, Ensemble methods: foundations and algorithms. CRC press, 2025.

P. Srinivas and R. Katarya, “hyOPTXg: OPTUNA hyper-parameter optimization framework for predicting cardiovascular disease using XGBoost,” Biomed Signal Process Control, vol. 73, p. 103456, Mar. 2022, doi: 10.1016/J.BSPC.2021.103456.

A. R. M. Rom, N. Jamil, and S. Ibrahim, “Multi objective hyperparameter tuning via random search on deep learning models,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 22, no. 4, pp. 956–968, 2024, doi: doi.org/10.12928/telkomnika.v22i4.25847.

D. M. W. Powers, “Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation,” arXiv preprint arXiv:2010.16061, 2020, doi: https://doi.org/10.48550/arXiv.2010.16061.

T. C. F. Polo and H. A. Miot, “Use of ROC curves in clinical and experimental studies,” 2020, SciELO Brasil. doi: https://doi.org/10.1093/ije/dyz274.

S. Raschka, “Model evaluation, model selection, and algorithm selection in machine learning,” arXiv preprint arXiv:1811.12808, 2018.

H. Naz and S. Ahuja, “Deep learning approach for diabetes prediction using PIMA Indian dataset,” J Diabetes Metab Disord, vol. 19, no. 1, pp. 391–403, Jun. 2020, doi: 10.1007/S40200-020-00520-5.

M. Maniruzzaman, M. J. Rahman, B. Ahammed, and M. M. Abedin, “Classification and prediction of diabetes disease using machine learning paradigm,” Health Inf Sci Syst, vol. 8, no. 1, p. 7, 2020, doi: https://doi.org/10.1007/s13755-019-0095-z.

Y. Zou et al., “Development and internal validation of machine learning algorithms for end-stage renal disease risk prediction model of people with type 2 diabetes mellitus and diabetic kidney disease,” Ren Fail, vol. 44, no. 1, pp. 562–570, 2022, doi: DOI: 10.1080/0886022X.2022.2056053.


Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Prediksi Diabetes Mellitus dengan Ensemble Gradient Boosting dan Advanced Feature Engineering

Dimensions Badge
Article History
Submitted: 2025-07-14
Published: 2025-09-04
Abstract View: 652 times
PDF Download: 289 times
How to Cite
Ramadhan, D., & Agustina, F. (2025). Prediksi Diabetes Mellitus dengan Ensemble Gradient Boosting dan Advanced Feature Engineering. Building of Informatics, Technology and Science (BITS), 7(2), 1222-1233. https://doi.org/10.47065/bits.v7i2.8011
Section
Articles