Klasifikasi Kelayakan Air Minum Mengkombinasikan Algoritma Random Forest dengan Teknik Optimasi Bayesian


  • Aditya Aqil Darmawan * Mail Universitas Dian Nuswantoro, Semarang, Indonesia
  • Ishak Bintang D Universitas Dian Nuswantoro, Semarang, Indonesia
  • Yani Parti Astuti Universitas Dian Nuswantoro, Semarang, Indonesia
  • Agus Winarno Universitas Dian Nuswantoro, Semarang, Indonesia
  • (*) Corresponding Author
Keywords: Water Quality; Machine Learning; Random Forest; XGBoost; SMOTE; Bayesian Search CV

Abstract

The quality of clean and safe drinking water is crucial for public health; however, environmental pollution from industrial waste, domestic waste, and urbanization has significantly deteriorated water quality. Manual methods for water quality analysis, such as the Water Quality Index (WQI) and STORET, have limitations in efficiency and accuracy. Therefore, this study proposes a machine learning-based classification system to determine the potability of drinking water more accurately and efficiently. The Water Potability dataset from Kaggle, consisting of 3,276 samples with nine key parameters, was used in this research. Initial analysis showed that most features had a nearly normal distribution, although some variables, such as Solids and Conductivity, exhibited right-skewness due to extreme values. Correlation analysis revealed no significant linear relationships between water quality parameters. The preprocessing stage included missing data imputation using the mean method, normalization, feature engineering, and oversampling with SMOTE to address class imbalance. The machine learning models used in this study include LightGBM, Random Forest, XGBoost, and CatBoost, with model optimization performed using Bayesian Search CV, which improved performance, particularly for Random Forest. Experimental results showed that the optimized Random Forest model achieved the best performance with an accuracy of 85.38%, precision of 85.86%, recall of 85.38%, and an F1-score of 85.37%. However, some misclassifications remained, especially in detecting potable water samples, indicating that ensemble learning methods can be effectively used to evaluate drinking water potability.

Downloads

Download data is not yet available.

References

C. Allen, G. Metternicht, and T. Wiedmann, “Initial progress in implementing the Sustainable Development Goals (SDGs): a review of evidence from countries,” Sustain. Sci., vol. 13, no. 5, pp. 1453–1467, 2018, doi: 10.1007/s11625-018-0572-3.

S. Tyagi, B. Sharma, P. Singh, and R. Dobhal, “Water Quality Assessment in Terms of Water Quality Index,” Am. J. Water Resour., vol. 1, no. 3, pp. 34–38, 2020, doi: 10.12691/ajwr-1-3-3.

P. A. Riyantoko, T. M. Fahrudin, and K. M. Hindrayani, “Analisis Sederhana Pada Kualitas Air Minum Berdasarkan Akurasi Model Klasifikasi Dengan Menggunakan Lucifer Machine Learning,” Pros. Semin. Nas. Sains Data, vol. 1, no. 01, pp. 12–18, 2021, doi: 10.33005/senada.v1i01.20.

N. Malagi, “Water Potability Prediction using Machine Learning,” Int. Res. J. Mod. Eng. Technol. Sci., no. 08, pp. 2779–2782, 2023, doi: 10.56726/irjmets44413.

C. N. Ihsan et al., “Comparison of Machine Learning Algorithms in Detecting Tea Leaf Diseases,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 8, no. 1, pp. 135–141, 2024, doi: 10.29207/resti.v8i1.5587.

L. Díaz-González, R.A. Aguilar-Rodríguez, J.C. Pérez-Sansalvador, N. Lakouari, “AQuA-P: A machine learning-based tool for water quality assessment”, J Contam Hydrol, vol 269, no 104498, 2025, doi:10.1016/j.jconhyd.2025.104498

Malik, N., Kalonia, A., Dalal, S. et al. "Optimized XGBoost Hyper-Parameter Tuned Model with Krill Herd Algorithm (KHA) for Accurate Drinking Water Quality Prediction," SN COMPUT. SCI. 6, 263, 2025, https://doi.org/10.1007/s42979-025-03813-9

Y. Cai and C. Daskalakis, “On minmax theorems for multiplayer games,” Proc. Annu. ACM-SIAM Symp. Discret. Algorithms, pp. 217–234, 2011, doi: 10.1137/1.9781611973082.20.

R. R. R. Arisandi, B. Warsito, and A. R. Hakim, “Aplikasi Naïve Bayes Classifier (Nbc) Pada Klasifikasi Status Gizi Balita Stunting Dengan Pengujian K-Fold Cross Validation,” J. Gaussian, vol. 11, no. 1, pp. 130–139, 2022, doi: 10.14710/j.gauss.v11i1.33991.

Zichong Wang, Zhipeng Yin, Yuying Zhang, Liping Yang, Tingting Zhang, Niki Pissinou, Yu Cai, Shu Hu, Yun Li, Liang Zhao, and Wenbin Zhang, " FG-SMOTE: Towards Fair Node Classification with Graph Neural Network," SIGKDD Explor. Newsl. 26, 2 (December 2024), 99–108. https://doi.org/10.1145/3715073.3715082

H. Los et al., “Evaluation of Xgboost and Lgbm Performance in Tree Species Classification With Sentinel-2 Data,” Int. Geosci. Remote Sens. Symp., vol. 2021-July, pp. 5803–5806, 2021, doi: 10.1109/IGARSS47720.2021.9553031.

J. Hu and S. Szymczak, “A review on longitudinal data analysis with random forest,” Brief. Bioinform., vol. 24, no. 2, pp. 1–11, 2023, doi: 10.1093/bib/bbad002.

Wowon Priatna, "Dampak Pengambilan Sampel Data untuk Optimalisasi Data tidak seimbang pada Klasifikasi Penipuan Transaksi E-Commerce " The Indonesian Journal of Computer Science ,Vol. 13, No.2, 2024, doi:10.33022/ijcs.v13i2.2698.

Xiaowei Li and Lanxin Shi and Yang Shi and Junqing Tang and Pengjun Zhao and Yuting Wang and Jun Chen," Exploring interactive and nonlinear effects of key factors on intercity travel mode choice using XGBoost " Applied Geography,Vol. 166, No. 103264, Doi:10.1016/ j.apgeog.2024.103264

F. Aziz, P. Ishak, and S. Abasa, “Klasifikasi Depresi Menggunakan Support Vector Machine: Pendekatan Berbasis Data Text Mining,” J. Pharm. Appl. Comput. Sci., vol. 2, no. 2, pp. 33–38, 2024, doi: 10.59823/jopacs.v2i2.53.

L. A. Yates, Z. Aandahl, S. A. Richards, and B. W. Brook, “Cross validation for model selection: A review with examples from ecology,” Ecol. Monogr., vol. 93, no. 1, pp. 1–24, 2023, doi: 10.1002/ecm.1557.

M. I. K. Saraan and R. F. A. K. Rambe, “Kebijakan Pengembangan Inovasi Teknologi Pertanian Presisi di Provinsi Sumatera Utara,” J. Kaji. Agrar. dan Kedaulatan Pangan, vol. 2, no. 1, pp. 1–5, 2023, doi: 10.32734/jkakp.v2i1.13319.

M. E. Lestari, I. Asror, and I. L. Sardi, “Penerapan PCA (Principal Component Analysis) pada Deteksi Outlier untuk Data Text,” e-Proceeding Eng., vol. 10, no. 3, p. 3549, 2023.

Pedro Lucas Negromonte Guerra, Inaê Carolline Silveira da Silva, Deoclides Lima Bezerra Júnior, Anderson Albert Primo Lopes, Geraldo de Sá Carneiro Filho, Eduardo Vieira de Carvalho Júnior,"Epidemiological and clinical characteristics of primary spinal cord glioblastomas ", Journal of Clinical Neuroscience,Vol. 130, No. 110862, doi : 10.1016/j.jocn.2024.110862

V. Jackins, S. Vimal, M. Kaliappan, and M. Y. Lee, “AI-based smart prediction of clinical disease using random forest classifier and Naive Bayes,” J. Supercomput., vol. 77, no. 5, pp. 5198–5219, 2021, doi: 10.1007/s11227-020-03481-x.


Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Klasifikasi Kelayakan Air Minum Mengkombinasikan Algoritma Random Forest dengan Teknik Optimasi Bayesian

Dimensions Badge
Article History
Submitted: 2025-02-24
Published: 2025-03-24
Abstract View: 218 times
PDF Download: 87 times
How to Cite
Darmawan, A., D, I., Astuti, Y., & Winarno, A. (2025). Klasifikasi Kelayakan Air Minum Mengkombinasikan Algoritma Random Forest dengan Teknik Optimasi Bayesian. Building of Informatics, Technology and Science (BITS), 6(4), 2647-2658. https://doi.org/10.47065/bits.v6i4.7055
Issue
Section
Articles