Klasifikasi Kelayakan Air Minum Mengkombinasikan Algoritma Random Forest dengan Teknik Optimasi Bayesian
Abstract
The quality of clean and safe drinking water is crucial for public health; however, environmental pollution from industrial waste, domestic waste, and urbanization has significantly deteriorated water quality. Manual methods for water quality analysis, such as the Water Quality Index (WQI) and STORET, have limitations in efficiency and accuracy. Therefore, this study proposes a machine learning-based classification system to determine the potability of drinking water more accurately and efficiently. The Water Potability dataset from Kaggle, consisting of 3,276 samples with nine key parameters, was used in this research. Initial analysis showed that most features had a nearly normal distribution, although some variables, such as Solids and Conductivity, exhibited right-skewness due to extreme values. Correlation analysis revealed no significant linear relationships between water quality parameters. The preprocessing stage included missing data imputation using the mean method, normalization, feature engineering, and oversampling with SMOTE to address class imbalance. The machine learning models used in this study include LightGBM, Random Forest, XGBoost, and CatBoost, with model optimization performed using Bayesian Search CV, which improved performance, particularly for Random Forest. Experimental results showed that the optimized Random Forest model achieved the best performance with an accuracy of 85.38%, precision of 85.86%, recall of 85.38%, and an F1-score of 85.37%. However, some misclassifications remained, especially in detecting potable water samples, indicating that ensemble learning methods can be effectively used to evaluate drinking water potability.
Downloads
References
C. Allen, G. Metternicht, and T. Wiedmann, “Initial progress in implementing the Sustainable Development Goals (SDGs): a review of evidence from countries,” Sustain. Sci., vol. 13, no. 5, pp. 1453–1467, 2018, doi: 10.1007/s11625-018-0572-3.
S. Tyagi, B. Sharma, P. Singh, and R. Dobhal, “Water Quality Assessment in Terms of Water Quality Index,” Am. J. Water Resour., vol. 1, no. 3, pp. 34–38, 2020, doi: 10.12691/ajwr-1-3-3.
P. A. Riyantoko, T. M. Fahrudin, and K. M. Hindrayani, “Analisis Sederhana Pada Kualitas Air Minum Berdasarkan Akurasi Model Klasifikasi Dengan Menggunakan Lucifer Machine Learning,” Pros. Semin. Nas. Sains Data, vol. 1, no. 01, pp. 12–18, 2021, doi: 10.33005/senada.v1i01.20.
N. Malagi, “Water Potability Prediction using Machine Learning,” Int. Res. J. Mod. Eng. Technol. Sci., no. 08, pp. 2779–2782, 2023, doi: 10.56726/irjmets44413.
C. N. Ihsan et al., “Comparison of Machine Learning Algorithms in Detecting Tea Leaf Diseases,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 8, no. 1, pp. 135–141, 2024, doi: 10.29207/resti.v8i1.5587.
L. Díaz-González, R.A. Aguilar-Rodríguez, J.C. Pérez-Sansalvador, N. Lakouari, “AQuA-P: A machine learning-based tool for water quality assessment”, J Contam Hydrol, vol 269, no 104498, 2025, doi:10.1016/j.jconhyd.2025.104498
Malik, N., Kalonia, A., Dalal, S. et al. "Optimized XGBoost Hyper-Parameter Tuned Model with Krill Herd Algorithm (KHA) for Accurate Drinking Water Quality Prediction," SN COMPUT. SCI. 6, 263, 2025, https://doi.org/10.1007/s42979-025-03813-9
Y. Cai and C. Daskalakis, “On minmax theorems for multiplayer games,” Proc. Annu. ACM-SIAM Symp. Discret. Algorithms, pp. 217–234, 2011, doi: 10.1137/1.9781611973082.20.
R. R. R. Arisandi, B. Warsito, and A. R. Hakim, “Aplikasi Naïve Bayes Classifier (Nbc) Pada Klasifikasi Status Gizi Balita Stunting Dengan Pengujian K-Fold Cross Validation,” J. Gaussian, vol. 11, no. 1, pp. 130–139, 2022, doi: 10.14710/j.gauss.v11i1.33991.
Zichong Wang, Zhipeng Yin, Yuying Zhang, Liping Yang, Tingting Zhang, Niki Pissinou, Yu Cai, Shu Hu, Yun Li, Liang Zhao, and Wenbin Zhang, " FG-SMOTE: Towards Fair Node Classification with Graph Neural Network," SIGKDD Explor. Newsl. 26, 2 (December 2024), 99–108. https://doi.org/10.1145/3715073.3715082
H. Los et al., “Evaluation of Xgboost and Lgbm Performance in Tree Species Classification With Sentinel-2 Data,” Int. Geosci. Remote Sens. Symp., vol. 2021-July, pp. 5803–5806, 2021, doi: 10.1109/IGARSS47720.2021.9553031.
J. Hu and S. Szymczak, “A review on longitudinal data analysis with random forest,” Brief. Bioinform., vol. 24, no. 2, pp. 1–11, 2023, doi: 10.1093/bib/bbad002.
Wowon Priatna, "Dampak Pengambilan Sampel Data untuk Optimalisasi Data tidak seimbang pada Klasifikasi Penipuan Transaksi E-Commerce " The Indonesian Journal of Computer Science ,Vol. 13, No.2, 2024, doi:10.33022/ijcs.v13i2.2698.
Xiaowei Li and Lanxin Shi and Yang Shi and Junqing Tang and Pengjun Zhao and Yuting Wang and Jun Chen," Exploring interactive and nonlinear effects of key factors on intercity travel mode choice using XGBoost " Applied Geography,Vol. 166, No. 103264, Doi:10.1016/ j.apgeog.2024.103264
F. Aziz, P. Ishak, and S. Abasa, “Klasifikasi Depresi Menggunakan Support Vector Machine: Pendekatan Berbasis Data Text Mining,” J. Pharm. Appl. Comput. Sci., vol. 2, no. 2, pp. 33–38, 2024, doi: 10.59823/jopacs.v2i2.53.
L. A. Yates, Z. Aandahl, S. A. Richards, and B. W. Brook, “Cross validation for model selection: A review with examples from ecology,” Ecol. Monogr., vol. 93, no. 1, pp. 1–24, 2023, doi: 10.1002/ecm.1557.
M. I. K. Saraan and R. F. A. K. Rambe, “Kebijakan Pengembangan Inovasi Teknologi Pertanian Presisi di Provinsi Sumatera Utara,” J. Kaji. Agrar. dan Kedaulatan Pangan, vol. 2, no. 1, pp. 1–5, 2023, doi: 10.32734/jkakp.v2i1.13319.
M. E. Lestari, I. Asror, and I. L. Sardi, “Penerapan PCA (Principal Component Analysis) pada Deteksi Outlier untuk Data Text,” e-Proceeding Eng., vol. 10, no. 3, p. 3549, 2023.
Pedro Lucas Negromonte Guerra, Inaê Carolline Silveira da Silva, Deoclides Lima Bezerra Júnior, Anderson Albert Primo Lopes, Geraldo de Sá Carneiro Filho, Eduardo Vieira de Carvalho Júnior,"Epidemiological and clinical characteristics of primary spinal cord glioblastomas ", Journal of Clinical Neuroscience,Vol. 130, No. 110862, doi : 10.1016/j.jocn.2024.110862
V. Jackins, S. Vimal, M. Kaliappan, and M. Y. Lee, “AI-based smart prediction of clinical disease using random forest classifier and Naive Bayes,” J. Supercomput., vol. 77, no. 5, pp. 5198–5219, 2021, doi: 10.1007/s11227-020-03481-x.
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel Klasifikasi Kelayakan Air Minum Mengkombinasikan Algoritma Random Forest dengan Teknik Optimasi Bayesian
Pages: 2647-2658
Copyright (c) 2025 Aditya Aqil Darmawan, Ishak Bintang D, Yani Parti Astuti, Agus Winarno

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).