SMOTE and BERT Approaches for Handling Class Imbalance in Sentiment Analysis of the CoreTax Application on Big Data
Abstract
Coretax is a tax information system developed by the Directorate General of Taxes (DJP) to support digital and integrated tax administration processes, covering everything from taxpayer registration to reporting and auditing. Although it was designed to improve efficiency, transparency, and accuracy in tax management, its implementation has sparked mixed reactions among the public due to various technical challenges and the complexity of the annual tax reporting process. This situation highlights the need for a sentiment analysis that can objectively capture public perceptions of the system’s performance. In this study, Natural Language Processing (NLP) and Machine Learning techniques were applied to analyze 3,000 tweets from Twitter (X) related to Coretax. One of the main issues identified in the dataset is class imbalance, where positive sentiments significantly outnumber negative and neutral ones, leading to biased classification results. To address this issue, the Synthetic Minority Over-sampling Technique (SMOTE) was used to balance the dataset by generating synthetic samples for the minority classes. The BERT model was then employed for sentiment classification because of its strong ability to understand contextual meaning through its transformer-based architecture. Experimental results show that before applying SMOTE, the BERT model achieved an accuracy of 77%, which increased to 80% after SMOTE was implemented, along with improvements in precision, recall, and F1-score, particularly for the minority classes. These findings demonstrate that the combination of SMOTE and BERT significantly enhances the performance of sentiment analysis in understanding public responses to Coretax. This approach can serve as a valuable reference for evaluating and improving tax digitalization policies, ensuring they are more effective, inclusive, and responsive to public needs.
Downloads
References
Nataherwin and A. E. Defin, “Pelatihan Penggunaan Coretax System untuk Pelaporan Perpajakan di PT Koilima Putra Mandiri,” J. Pustaka Mitra, vol. 5, no. 5, pp. 265–269, 2025, doi: https://doi.org/10.55382/jurnalpustakamitra.v5i5.1155.
D. F. Nurhaeni, D. Masitoh, H. Shofurani, N. K. Livtanta, and Ridwan, “Analisis Efektifitas dan Efisiensi Sistem CORETAX: Mengukur Kepercayaan Publik di Tengah Transisi Sistem Perpajakan 2025,” J. Sos. Polit., vol. 6, no. 1, pp. 21–37, 2025, doi: 10.54144/jsp.v6i1.103.
A. S. Rizkia, Wufron, and F. F. Roji, “Sentiment Analysis of Coretax: A Comparison of Manual, Transformers- Based, and Lexicon-Based Data Labeling on IndoBERT Performance,” MALCOM Indones. J. Mach. Learn. Comput. Sci., vol. 5, no. 3, pp. 1037–1048, 2025, doi: https://doi.org/10.57152/malcom.v5i3.2151 1037.
Fathoni, A. F. Ansori, I. N. Ramadhani, C. R. Anissa, and S. A. Putri, “Analisis Sentimen Masyarakat Indonesia di Twitter Terhadap Sistem Perpajakan ‘Coretax’ Menggunakan Metode Naïve Bayes,” JATI (Jurnal Mhs. Tek. Inform., vol. 9, no. 4, pp. 6749–6753, 2025, doi: https://10.36040/jati.v9i4.14214.
S. Pais, J. Cordeiro, and M. L. Jamil, “NLP-based platform as a service: a brief review,” J. Big Data, vol. 9, no. 1, 2022, doi: 10.1186/s40537-022-00603-5.
M. Chiny, M. Chihab, O. Bencharef, and Y. Chihab, “Netflix Recommendation System based on TF-IDF and Cosine Similarity Algorithms,” in Proceedings ofthe 2nd International Conference on Big Data, Modelling and Machine Learning, Science and Technology Publications, Lda, 2022, pp. 15–20. doi: 10.5220/0010727500003101.
S. Jaradat, R. Nayak, A. Paz, and M. Elhenawy, “Ensemble Learning with Pre-Trained Transformers for Crash Severity Classification: A Deep NLP Approach,” Algorithms, vol. 17, no. 7, 2024, doi: 10.3390/a17070284.
L. Nurina, S. H. Hairuddin, A. A. Bakri, and A. Pilua, “Tinjauan Bibliometrik Terhadap Pemanfaatan Big Data, Analisis Sentimen, dan Kriptokurensi dalam Analisis Pajak,” Sanskara Akunt. dan Keuang., vol. 2, no. 01, pp. 66–76, 2023, doi: 10.58812/sak.v2i01.257.
Putri Angraini Aziz, S. B. Nur Ilahi, Sumiarni Moka, and A. M. Sajiah, “Penerapan Hadoop untuk Analisis Sentimen Berbasis Big Data pada Ulasan Aplikasi Transportasi Online,” SATESI J. Sains Teknol. dan Sist. Inf., vol. 5, no. 1, pp. 51–60, 2025, doi: 10.54259/satesi.v5i1.4051.
B. Ramadhani and R. R. Suryono, “Komparasi Algoritma Naïve Bayes dan Logistic Regression Untuk Analisis Sentimen Metaverse,” J. Media Inform. Budidarma, vol. 8, no. 2, p. 714, 2024, doi: 10.30865/mib.v8i2.7458.
E. R. Lidinillah, T. Rohana, and A. R. Juwita, “Analisis sentimen twitter terhadap steam menggunakan algoritma logistic regression dan support vector machine,” TEKNOSAINS J. Sains, Teknol. dan Inform., vol. 10, no. 2, pp. 154–164, 2023, doi: 10.37373/tekno.v10i2.440.
S. Rabbani, D. Safitri, N. Rahmadhani, A. A. F. Sani, and M. K. Anam, “Perbandingan Evaluasi Kernel SVM untuk Klasifikasi Sentimen dalam Analisis Kenaikan Harga BBM: Comparative Evaluation of SVM Kernels for Sentiment Classification in Fuel Price Increase Analysis,” MALCOM Indones. J. Mach. Learn. Comput. Sci., vol. 3, no. 2, pp. 153–160, 2023, [Online]. Available: https://journal.irpi.or.id/index.php/malcom/article/view/897%0Ahttps://journal.irpi.or.id/index.php/malcom/article/download/897/421
D. Dablain, B. Krawczyk, and N. V. Chawla, “DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data,” IEEE Trans. Neural Networks Learn. Syst., vol. 34, no. 9, pp. 6390–6404, 2023, doi: 10.1109/TNNLS.2021.3136503.
I. S. Ramadhan and A. Salam, “Teknik Random Undersampling untuk Mengatasi Ketidakseimbangan Kelas pada CT Scan Kista Ginjal,” Techno.Com, vol. 23, no. 1, pp. 20–28, 2024, doi: 10.62411/tc.v23i1.9738.
Y. Wu, Z. Jin, C. Shi, P. Liang, and T. Zhan, “Research on the application of deep learning-based BERT model in sentiment analysis,” Appl. Comput. Eng., vol. 71, no. 1, pp. 14–20, 2024, doi: 10.54254/2755-2721/71/2024ma.
Y. Wen, Y. Liang, and X. Zhu, “Sentiment analysis of hotel online reviews using the BERT model and ERNIE model—Data from China,” PLoS One, vol. 18, no. 3 March, pp. 1–14, 2023, doi: 10.1371/journal.pone.0275382.
Vidya Chandradev, I Made Agus Dwi Suarjaya, and I Putu Agung Bayupati, “Analisis Sentimen Review Hotel Menggunakan Metode Deep Learning BERT,” J. Buana Inform., vol. 14, no. 02, pp. 107–116, 2023, doi: 10.24002/jbi.v14i02.7244.
A. Ripa’i, F. Santoso, and F. Lazim, “Deteksi Berita Hoax dengan Perbandingan Website Menggunakan Pendekatan Deep Learning Algoritma BERT,” G-Tech J. Teknol. Terap., vol. 8, no. 3, pp. 1749–1758, 2024, doi: 10.33379/gtech.v8i3.4541.
N. Nurwanda, N. Suarna, and W. Prihartono, “Penerapan Nlp (Natural Language Processing) Dalam Analisis Sentimen Pengguna Telegram Di Playstore,” JATI (Jurnal Mhs. Tek. Inform., vol. 8, no. 2, pp. 1841–1846, 2024, doi: 10.36040/jati.v8i2.8469.
M. R. A. Prasetya and A. M. Priyatno, “Dice Similarity and TF-IDF for New Student Admissions Chatbot,” RIGGS J. Artif. Intell. Digit. Bus., vol. 1, no. 1, pp. 13–18, 2022, doi: 10.31004/riggs.v1i1.5.
S. K. Rongali, “Natural Language Processing in Artificial Intelligence,” World J. Adv. Res. Rev., vol. 25, no. 1, pp. 1931–1935, 2025, doi: 10.1201/9780367808495.
F. M. Sinaga, W. S. Lestari, S. Winardi, and K. H. Rambe, “ENHANCING SENTIMENT ANALYSIS ACCURACY WITH BERT AND SILHOUETTE METHOD OPTIMIZATION,” JITK (Jurnal Ilmu Pengetah. dan Teknol. Komputer), vol. 11, no. 1, pp. 76–86, 2025, doi: 10.33480/jitk.v11i1.6392.Transformers.
K. Pramayasa, I. M. D. Maysanjaya, and I. G. A. A. D. Indradewi, “Analisis Sentimen Program Mbkm Pada Media Sosial Twitter Menggunakan KNN Dan SMOTE,” SINTECH (Science Inf. Technol. J., vol. 6, no. 2, pp. 89–98, 2023, doi: 10.31598/sintechjournal.v6i2.1372.
Candra, K. W. Chandra, and H. Irsyad, “Efektifitas SMOTE dalam Mengatasi Imbalanced Class Algoritma K-Nearest Neighbors pada Analisis Sentimen terhadap Starlink,” J. Ilmu Komput. dan Inform., vol. 4, no. 1, pp. 31–42, 2024, doi: 10.54082/jiki.132.
B. Kurniawan, A. Suwarisman, I. Afriyanti, A. Wahyudi, and D. D. Saputra, “Analisis Sentimen Complain dan Bukan Complain pada Twitter Telkomsel dengan SMOTE dan Naïve Bayes,” J. JTIK (Jurnal Teknol. Inf. dan Komunikasi), vol. 7, no. 1, pp. 106–113, 2023, doi: 10.35870/jtik.v7i1.691.
Z. A. Sriyanti, D. S. Y. Kartika, and A. R. E. Najaf, “Implementasi Model Bert Pada Analisis Sentimen Pengguna Twitter Terhadap Aksi Boikot Produk Israel,” J. Inform. dan Tek. Elektro Terap., vol. 12, no. 3, pp. 2335–2342, 2024, doi: 10.23960/jitet.v12i3.4743.
Ardiansyah, Adika Sri Widagdo, Krisna Nuresa Qodri, F. E. N. Saputro, and Nisrina Akbar Rizky P, “Analisis sentimen terhadap pelayanan Kesehatan berdasarkan ulasan Google Maps menggunakan BERT,” J. Fasilkom, vol. 13, no. 02, pp. 326–333, 2023, doi: 10.37859/jf.v13i02.5170.
P. Wulff, L. Mientus, A. Nowak, and A. Borowski, “Utilizing a Pretrained Language Model (BERT) to Classify Preservice Physics Teachers’ Written Reflections,” Int. J. Artif. Intell. Educ., vol. 33, no. 3, pp. 439–466, 2023, doi: 10.1007/s40593-022-00290-6.
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel SMOTE and BERT Approaches for Handling Class Imbalance in Sentiment Analysis of the CoreTax Application on Big Data
Pages: 1467-1476
Copyright (c) 2025 Meiliyani Br Ginting, Asprina Br Surbakti, Safarul Ilham, Dito Putro Utomo, Raheliya Br Ginting

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).





















