SMOTE-Based Oversampling for Imbalanced Digital Fraud Risk Classification

Ika Nur Laily Fitriana; Fonda Leviany; Kurnia Sari Kasmiarno; Mohammad Okky Mabruri

doi:10.47065/tin.v6i11.9589

Ika Nur Laily Fitriana * Universitas Terbuka, Tangerang Selatan, Indonesia
Fonda Leviany Universitas Terbuka, Tangerang Selatan, Indonesia
Kurnia Sari Kasmiarno Universitas Terbuka, Tangerang Selatan, Indonesia
Mohammad Okky Mabruri Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia

(*) Corresponding Author

DOI: https://doi.org/10.47065/tin.v6i11.9589

Keywords: Digital Fraud Risk; Imbalanced Classification; SMOTE; Survey-Based Prediction; Machine Learning

Abstract

Digital fraud risk among university students is an important issue, yet classification using survey-based indicators is complicated by class imbalance. This study examined whether Synthetic Minority Over Sampling Technique (SMOTE) improves Digital Fraud Risk classification among Universitas Terbuka students. This research used primary survey data from 498 respondents and modeled using five predictors representing financial literacy, digital financial literacy, monthly gross income, age, and job tenure. The evaluated models were Gaussian Naive Bayes, Random Forest, calibrated linear Support Vector Machine (SVM), Radial Basis Function SVM, and XGBoost. The performance of model was evaluated using confusion matrix, accuracy, balanced accuracy, precision, recall, F1 score, ROC-AUC, PR-AUC, MCC and Kappa. This research revealed that without oversampling, some models showed higher nominal accuracy but zero recall for High risk. It means that accuracy is insufficient for model selection under imbalance. In contrast, SMOTE increased recall for the High risk class across all models and improved PR AUC in several cases. The SMOTE based Random Forest achieved the highest test PR AUC (0.415), whereas the SMOTE based RBF SVM achieved the highest recall (0.659). Diagnostic analyses for the selected SMOTE based Random Forest provided evidence of non-random predictive signal, although overall discriminative performance remained moderate.

Downloads

Download data is not yet available.

References

Ali, A., Abd Razak, S., Othman, S. H., Eisa, T. A. E., Al-Dhaqm, A., Nasser, M., Elhassan, T., Elshafie, H., & Saif, A. (2022). Financial Fraud Detection Based on Machine Learning: A Systematic Literature Review. Applied Sciences, 12(19), 9637. https://doi.org/10.3390/app12199637

Bhaduri, D., Toth, D., & Holan, S. H. (2025). A Review of Tree‐Based Methods for Analyzing Survey Data. WIREs Computational Statistics, 17(1). https://doi.org/10.1002/wics.70010

Breiman, L. (2001). Random Forests. 45, 5–32.

Carvalho, M., Pinho, A. J., & Brás, S. (2025). Resampling approaches to handle class imbalance: a review from a data perspective. Journal of Big Data, 12(1), 71. https://doi.org/10.1186/s40537-025-01119-4

Chen, W., Yang, K., Yu, Z., Shi, Y., & Chen, C. L. P. (2024). A survey on imbalanced learning: latest research, applications and future directions. Artificial Intelligence Review, 57(6), 137. https://doi.org/10.1007/s10462-024-10759-6

Choung, Y., Chatterjee, S., & Pak, T.-Y. (2023). Digital financial literacy and financial well-being. Finance Research Letters, 58, 104438. https://doi.org/10.1016/j.frl.2023.104438

Elreedy, D., Atiya, A. F., & Kamalov, F. (2024). A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning, 113(7), 4903–4923. https://doi.org/10.1007/s10994-022-06296-4

Gao, X., Xie, D., Zhang, Y., Wang, Z., Chen, C., He, C., Yin, H., & Zhang, W. (2026). A comprehensive survey on imbalanced data learning. Frontiers of Computer Science, 20(11), 2011622. https://doi.org/10.1007/s11704-025-50274-7

Guido, R., Ferrisi, S., Lofaro, D., & Conforti, D. (2024). An Overview on the Advancements of Support Vector Machine Models in Healthcare Applications: A Review. Information, 15(4), 235. https://doi.org/10.3390/info15040235

Hairani, H., Widiyaningtyas, T., & Dwi Prasetya, D. (2024). Addressing Class Imbalance of Health Data: A Systematic Literature Review on Modified Synthetic Minority Oversampling Technique (SMOTE) Strategies. JOIV : International Journal on Informatics Visualization, 8(3), 1310. https://doi.org/10.62527/joiv.8.3.2283

Khalid, A. R., Owoh, N., Uthmani, O., Ashawa, M., Osamor, J., & Adejoh, J. (2024). Enhancing Credit Card Fraud Detection: An Ensemble Machine Learning Approach. Big Data and Cognitive Computing, 8(1), 6. https://doi.org/10.3390/bdcc8010006

Kivrak, M., Avci, U., Uzun, H., & Ardic, C. (2024). The Impact of the SMOTE Method on Machine Learning and Ensemble Learning Performance Results in Addressing Class Imbalance in Data Used for Predicting Total Testosterone Deficiency in Type 2 Diabetes Patients. Diagnostics, 14(23), 2634. https://doi.org/10.3390/diagnostics14232634

Leviany, F., Kasmiarno, K. S., & Fitriana, I. N. L. (2025). Predicting Digital Fraud Risk Using Support Vector Machine Classifier A Case Study Of Universitas Terbuka Students. Proceeding of The International Seminar on Business, Economics, Social Science and Technology (ISBEST), 54–60. https://doi.org/10.33830/isbest.v5i1.7407

Lokanan, M., & Liu, S. (2021). Predicting Fraud Victimization Using Classical Machine Learning. Entropy, 23(3), 300. https://doi.org/10.3390/e23030300

Malhotra, R., & Lata, K. (2022). Handling class imbalance problem in software maintainability prediction: an empirical investigation. Frontiers of Computer Science, 16(4), 164205. https://doi.org/10.1007/s11704-021-0127-0

Pantic, I. V., Paunovic Pantic, J., Valjarevic, S., Corridon, P. R., & Topalovic, N. (2025). Artificial intelligence – based approaches based on random forest algorithm for signal analysis: Potential applications in detection of chemico - biological interactions. Chemico-Biological Interactions, 418, 111624. https://doi.org/10.1016/j.cbi.2025.111624

Salman, H. A., Kalakech, A., & Steiti, A. (2024). Random Forest Algorithm Overview. Babylonian Journal of Machine Learning, 2024, 69–79. https://doi.org/10.58496/BJML/2024/007

Saputra, D., ’Alauddin, A. A. F., & Azizan, M. (2025). Comparative Analysis of Gaussian Naïve Bayes and Categorical Naïve Bayes Algorithms with Laplace Smoothing in COVID-19 Detection. Jurnal Ilmu Komputer Dan Informatika, 5(1), 69–78. https://doi.org/10.54082/jiki.286

Sayegh, H. R., Dong, W., & Al-madani, A. M. (2024). Enhanced Intrusion Detection with LSTM-Based Model, Feature Selection, and SMOTE for Imbalanced Data. Applied Sciences, 14(2), 479. https://doi.org/10.3390/app14020479

Sulaiman, B. R., Schetinin, V., & Sant, P. (2022). Review of Machine Learning Approach on Credit Card Fraud Detection. Human-Centric Intelligent Systems, 2(1–2), 55–68. https://doi.org/10.1007/s44230-022-00004-0

Wibowo, P., & Fatichah, C. (2021). An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset. Register: Jurnal Ilmiah Teknologi Sistem Informasi, 7(1), 63. https://doi.org/10.26594/register.v7i1.2206

Xiao, X., Li, X., & Zhou, Y. (2022). Financial literacy overconfidence and investment fraud victimization. Economics Letters, 212, 110308. https://doi.org/10.1016/j.econlet.2022.110308

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel SMOTE-Based Oversampling for Imbalanced Digital Fraud Risk Classification