The Application of The Neighborhood Cleaning Rule in Conjunction with Random Forest, K-Fold Cross-Validation, and Grid Search for Addressing Imbalanced Datasets

Laila Qadrini; Muh Hijrah; Laelatul Hikmah; Handayani Handayani

doi:10.47065/tin.v3i8.4124

Laila Qadrini * Universitas Sulawesi Barat, Sulawesi Barat, Indonesia
Muh Hijrah Universitas Sulawesi Barat, Sulawesi Barat, Indonesia
Laelatul Hikmah Institut Teknologi Statistika dan Bisnis Muhammadiyah Semarang, Semarang, Indonesia
Handayani Handayani Universitas Sulawesi Barat, Sulawesi Barat, Indonesia

(*) Corresponding Author

DOI: https://doi.org/10.47065/tin.v3i8.4124

Keywords: NCL; BBLR; Random Forest; Kfold; Tune Grid Search

Abstract

Finding a model that explains and separates data classes is the process of classification in data mining, which is used to guess the class of an item with an unknown class. Numerous strategies have been developed since categorization can be applied in a wide range of applications. But a common issue with classification is class imbalance. Data predictability suffers as a result of the issue of unbalanced classes. There are typically not an equal number of examples in each class in real-world categorization datasets. Class imbalance is not a problem when there are not significant differences in how the classes are distributed. Due to class imbalance, prediction models may skew in favor of the majority class, with the minority class contributing little to the model. One often used strategy for addressing class imbalance is the resampling technique. This study's objective is to put the Resampling Algorithm into practice. Neighborhood Cleaning Rule Random Forest K-Fold Tune Grid Search was carried out on a dataset that includes cases of Low Birth Weight Infants (BBLR) in Majene Regency and breast cancer diagnoses, which was posted on the UCI website. The Neighborhood Cleaning Rule (NCL), a data processing method, eliminates noise or other disturbances from datasets used for modeling or analysis. The F1-Score, G-Mean, Accuracy, and Sensitivity values from the model are good.

Downloads

Download data is not yet available.

References

Arifiyanti, A. A., & Wahyuni, E. D. (2020). SMOTE: Metode penyeimbang kelas pada klasifikasi data mining. Scan: Jurnal Teknologi Informasi Dan Komunikasi, 15(1), 34–39.

Astuti, F. D., & Lenti, F. N. (2021.). Implementasi SMOTE untuk mengatasi Imbalance Class pada Klasifikasi Car Evolution menggunakan K-NN.

Bappenas, S. (2020). Metadata Indikator Tujuan Pembangunan Berkelanjutan (TPB). Sustainable Development Goals (SDGs) Indonesia Pilar Pembangunan Ekonomi.

Choirunnisa, S. (2019). Metode hibrida oversampling dan ketidakseimbangan data kegagalan.

Devella, S., Yohannes, Y., & Rahmawati, F. N. (2020). Implementasi Random Forest Untuk Klasifikasi Motif Songket Palembang Berdasarkan SIFT. JATISI (Jurnal Teknik Informatika Dan Sistem Informasi), 7(2), 310–320.

Erlin, E., Desnelita, Y., Nasution, N., Suryati, L., & Zoromi, F. (2022). Dampak SMOTE terhadap Kinerja Random Forest Classifier berdasarkan Data Tidak seimbang. MATRIK: Jurnal Manajemen, Teknik Informatika Dan Rekayasa Komputer, 21(3), 677–690.

Ihfa, R., & Harsanti, T. (2020). Komparasi Teknik Resampling Pada Pemodelan Regresi Logistik Biner. Seminar Nasional Official Statistics, 2020(1), 863–870.

Kemenkes, R. I. (2019). Profil Kesehatan Indonesia Tahun 2021. Kementerian Kesehatan Republik Indonesia. Jakarta: Kementerian Kesehatan Republik Indonesia.

Lestari, A., Mariati, E., & Widiatry, W. (2020). Model Klasifikasi Kepuasan Mahasiswa Teknik Terhadap Sarana Pembelajaran Menggunakan Data Mining. Jurnal Teknologi Informasi: Jurnal Keilmuan Dan Aplikasi Bidang Teknik Informatika, 14(2), 112–118.

Lujan-Moreno, G. A., Howard, P. R., Rojas, O. G., & Montgomery, D. C. (2018). Design of experiments and response surface methodology to tune machine learning hyperparameters, with a random forest case-study. Expert Systems with Applications, 109, 195–205.

Nugraha, W., & Sasongko, A. (2022). Hyperparameter Tuning pada Algoritma Klasifikasi dengan Grid Search. SISTEMASI : Jurnal Sistem Informasi, 11(2), 391–401.

Pangestika, M. P., Sumertajaya, I. M., & Rizki, A. (2021). Penerapan Synthetic Minority Oversampling Technique pada Pemodelan Regresi Logistik Biner terhadap Keberhasilan Studi Mahasiswa Program Magister IPB. Xplore: Journal of Statistics, 10(2), 152–166.

Qadrini, L., Hikmah, H., & Megasari, M. (2022). Oversampling, Undersampling, Smote SVM dan Random Forest pada Klasifikasi Penerima Bidikmisi Sejawa Timur Tahun 2017. Journal of Computer System and Informatics (JoSYC), 3(4), 386–391. https://doi.org/10.47065/josyc.v3i4.2154

Qadrini L, Sepperwali A, & Aina A. (2021). Decision Treedan Adaboostpada Klasifikasi Penerima Program Bantuan Sosial. Decision Tree Dan Adaboost Pada Klasifikasi Penerima Program Bantuan Sosial, 2(7), 1959–1966.

Siringoringo, R. (2018). Klasifikasi data tidak seimbang menggunakan algoritma SMOTE dan k-nearest neighbor. Journal Information System Development (ISD), 3(1).

Suryani Agustin, Budi Darma Setiawan, & Mochammad Ali Fauzi. (2019). Klasifikasi Berat Badan Lahir Rendah (BBgustin, Suryani Setiawan, Budi Darma Fauzi, Mochammad AlLR) Pada Bayi Dengan Metode Learning Vector Quantization (LVQ). Jurnal Pengembangan Teknologi Informasi Dan Ilmu Komputer, 3(3), 2929–2936. https://j-ptiik.ub.ac.id/index.php/j-ptiik/article/download/4831/2254/

Turlapati, V. P. K., & Prusty, M. R. (2020). Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19. Intelligence-Based Medicine, 3, 100023.

Wasono, R. (2022). Perbandingan Metode Random Forest dan naive bayes untuk Klasifikasi Debitur Berdasarkan Kualitas Kredit.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel The Application of The Neighborhood Cleaning Rule in Conjunction with Random Forest, K-Fold Cross-Validation, and Grid Search for Addressing Imbalanced Datasets