Handling Imbalanced Data Sets Using SMOTE and ADASYN to Improve Classification Performance of Ecoli Data Sets

Anthony Mas Halim; Mahendra Dwifebri; Fhira Nhita

doi:10.47065/bits.v5i1.3647

Anthony Mas Halim * Telkom University, Bandung, Indonesia
Mahendra Dwifebri Telkom University, Bandung, Indonesia
Fhira Nhita Telkom University, Bandung, Indonesia

(*) Corresponding Author

DOI: https://doi.org/10.47065/bits.v5i1.3647

Keywords: Imbalanced data; Random Forest; SMOTE; ADASYN

Abstract

In this digital era, machine learning is a technology that is in demand by organizations and individuals. In the age of data and digital information, the ability to process data efficiently is needed. As the amount of data grows, there are various problems in machine learning. One of them is that with the increasing amount of data, class imbalance is also often found. Class imbalance is a condition where a class dominates another class, in one example case is when the positive value class has less number than the negative class. The class that is less in number is categorized as the minority class, while the class that dominates the dataset is called the majority class. Class imbalance can affect classification performance in a bad way, so handling imbalanced classes is needed to improve classification results. Classification of imbalanced data using Random Forest has satisfactory results, as well as by implementing SMOTE and ADASYN as sampling methods because they are highly popular and easy to implement. The best model produced in this study is the model that applies SMOTE oversampling on a dataset with 10% IR with a balanced accuracy of 98.75%, and the best result when applying ADASYN oversampling is on a dataset with 13% IR and a balanced accuracy of 99.03%.

Downloads

Download data is not yet available.

References

X. Jiang and Z. Ge, “Data Augmentation Classifier for Imbalanced Fault Classification,” IEEE Trans. Autom. Sci. Eng., vol. 18, no. 3, pp. 1206–1217, 2021, doi: 10.1109/TASE.2020.2998467.

K. U. Syaliman, “Enhance the Accuracy of K-Nearest Neighbor ( K-Nn ) for Unbalanced Class Data Using Synthetic Minority Oversampling Technique ( Smote ) and Gain Ratio ( Gr ),” J. Infokum, vol. 10, no. 1, pp. 188–195, 2021.

D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance,” Inf. Sci. (Ny)., vol. 505, pp. 32–64, 2019, doi: 10.1016/j.ins.2019.07.070.

J. Brandt and E. Lanzén, “A Comparative Review of SMOTE and ADASYN in Imbalanced Data Classification,” p. 42, 2020.

H. A. Gameng, B. D. Gerardo, and R. P. Medina, “A Modified Adaptive Synthetic SMOTE Approach in Graduation Success Rate Classification A Modified Adaptive Synthetic SMOTE Approach in Graduation Success Rate Classification,” no. December 2019, 2020, doi: 10.30534/ijatcse/2019/63862019.

N. G. Ramadhan, “Comparative Analysis of ADASYN-SVM and SMOTE-SVM Methods on the Detection of Type 2 Diabetes Mellitus,” Sci. J. Informatics, vol. 8, no. 2, pp. 276–282, 2021, doi: 10.15294/sji.v8i2.32484.

J. Alcalá-Fdez et al., “KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” J. Mult. Log. Soft Comput., vol. 17, no. 2–3, pp. 255–287, 2011.

S. W. Yahaya, A. Lotfi, and M. Mahmud, “A Consensus Novelty Detection Ensemble Approach for Anomaly Detection in Activities of Daily Living,” Appl. Soft Comput. J., vol. 83, p. 105613, 2019, doi: 10.1016/j.asoc.2019.105613.

J. L. P. Lima, D. MacEdo, and C. Zanchettin, “Heartbeat Anomaly Detection using Adversarial Oversampling,” Proc. Int. Jt. Conf. Neural Networks, vol. 2019-July, no. July, pp. 1–7, 2019, doi: 10.1109/IJCNN.2019.8852242.

P. Soltanzadeh and M. Hashemzadeh, “RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem,” Inf. Sci. (Ny)., vol. 542, pp. 92–111, 2021, doi: 10.1016/j.ins.2020.07.014.

J. Park, S. Kwon, and S. P. Jeong, “A study on improving turnover intention forecasting by solving imbalanced data problems: focusing on SMOTE and generative adversarial networks,” J. Big Data, vol. 10, no. 1, 2023, doi: 10.1186/s40537-023-00715-6.

A. O. Technique et al., “DAD-Net : Classification of Alzheimer ’ s Disease Using Neural Network,” pp. 1–21, 2022.

V. Jackins, S. Vimal, M. Kaliappan, and M. Y. Lee, “AI-based smart prediction of clinical disease using random forest classifier and Naive Bayes,” J. Supercomput., vol. 77, no. 5, pp. 5198–5219, 2021, doi: 10.1007/s11227-020-03481-x.

I. Ullah, B. Raza, A. K. Malik, M. Imran, S. U. Islam, and S. W. Kim, “A Churn Prediction Model Using Random Forest: Analysis of Machine Learning Techniques for Churn Prediction and Factor Identification in Telecom Sector,” IEEE Access, vol. 7, no. c, pp. 60134–60149, 2019, doi: 10.1109/ACCESS.2019.2914999.

I. Prayoga and M. D. P, “Sentiment Analysis on Indonesian Movie Review Using KNN Method With the Implementation of Chi-Square Feature Selection,” vol. 7, pp. 369–375, 2023, doi: 10.30865/mib.v7i1.5522.

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, pp. 1–13, 2020, doi: 10.1186/s12864-019-6413-7.

M. Grandini, E. Bagli, and G. Visani, “Metrics for Multi-Class Classification: an Overview,” pp. 1–17, 2020, [Online]. Available: http://arxiv.org/abs/2008.05756

R. Arora, C. T. Tsai, K. Tsereteli, P. Kambadur, and Y. Yang, “A semi-Markov structured support vector machine model for high-precision named entity recognition,” ACL 2019 - 57th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf., no. 2005, pp. 5862–5866, 2020, doi: 10.18653/v1/p19-1587.

N. Munsch et al., “Diagnostic accuracy of web-based COVID-19 symptom checkers: Comparison study,” J. Med. Internet Res., vol. 22, no. 10, 2020, doi: 10.2196/21299.

D. Chicco, N. Tötsch, and G. Jurman, “The matthews correlation coefficient (Mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,” BioData Min., vol. 14, pp. 1–22, 2021, doi: 10.1186/s13040-021-00244-z.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Handling Imbalanced Data Sets Using SMOTE and ADASYN to Improve Classification Performance of Ecoli Data Sets

Handling Imbalanced Data Sets Using SMOTE and ADASYN to Improve Classification Performance of Ecoli Data Sets

Abstract

Downloads

References

Most read articles by the same author(s)