SMOTE and BERT Approaches for Handling Class Imbalance in Sentiment Analysis of the CoreTax Application on Big Data


  • Meiliyani Br Ginting * Mail Institute of Technology and Business Indonesia, Medan, Indonesia
  • Asprina Br Surbakti Institute of Technology and Business Indonesia, Kabanjahe, Indonesia
  • Safarul Ilham Institute of Technology and Business Indonesia, Medan, Indonesia
  • Dito Putro Utomo Medan State Polytechnic, Medan, Indonesia
  • Raheliya Br Ginting Institute of Technology and Business Indonesia, Medan, Indonesia
  • (*) Corresponding Author
Keywords: SMOTE; BERT; Class Imbalance; Sentiment Analysis; Coretax Application

Abstract

Coretax is a tax information system developed by the Directorate General of Taxes (DJP) to support digital and integrated tax administration processes, covering everything from taxpayer registration to reporting and auditing. Although it was designed to improve efficiency, transparency, and accuracy in tax management, its implementation has sparked mixed reactions among the public due to various technical challenges and the complexity of the annual tax reporting process. This situation highlights the need for a sentiment analysis that can objectively capture public perceptions of the system’s performance. In this study, Natural Language Processing (NLP) and Machine Learning techniques were applied to analyze 3,000 tweets from Twitter (X) related to Coretax. One of the main issues identified in the dataset is class imbalance, where positive sentiments significantly outnumber negative and neutral ones, leading to biased classification results. To address this issue, the Synthetic Minority Over-sampling Technique (SMOTE) was used to balance the dataset by generating synthetic samples for the minority classes. The BERT model was then employed for sentiment classification because of its strong ability to understand contextual meaning through its transformer-based architecture. Experimental results show that before applying SMOTE, the BERT model achieved an accuracy of 77%, which increased to 80% after SMOTE was implemented, along with improvements in precision, recall, and F1-score, particularly for the minority classes. These findings demonstrate that the combination of SMOTE and BERT significantly enhances the performance of sentiment analysis in understanding public responses to Coretax. This approach can serve as a valuable reference for evaluating and improving tax digitalization policies, ensuring they are more effective, inclusive, and responsive to public needs.

Downloads

Download data is not yet available.

References

Nataherwin and A. E. Defin, “Pelatihan Penggunaan Coretax System untuk Pelaporan Perpajakan di PT Koilima Putra Mandiri,” J. Pustaka Mitra, vol. 5, no. 5, pp. 265–269, 2025, doi: https://doi.org/10.55382/jurnalpustakamitra.v5i5.1155.

D. F. Nurhaeni, D. Masitoh, H. Shofurani, N. K. Livtanta, and Ridwan, “Analisis Efektifitas dan Efisiensi Sistem CORETAX: Mengukur Kepercayaan Publik di Tengah Transisi Sistem Perpajakan 2025,” J. Sos. Polit., vol. 6, no. 1, pp. 21–37, 2025, doi: 10.54144/jsp.v6i1.103.

A. S. Rizkia, Wufron, and F. F. Roji, “Sentiment Analysis of Coretax: A Comparison of Manual, Transformers- Based, and Lexicon-Based Data Labeling on IndoBERT Performance,” MALCOM Indones. J. Mach. Learn. Comput. Sci., vol. 5, no. 3, pp. 1037–1048, 2025, doi: https://doi.org/10.57152/malcom.v5i3.2151 1037.

Fathoni, A. F. Ansori, I. N. Ramadhani, C. R. Anissa, and S. A. Putri, “Analisis Sentimen Masyarakat Indonesia di Twitter Terhadap Sistem Perpajakan ‘Coretax’ Menggunakan Metode Naïve Bayes,” JATI (Jurnal Mhs. Tek. Inform., vol. 9, no. 4, pp. 6749–6753, 2025, doi: https://10.36040/jati.v9i4.14214.

S. Pais, J. Cordeiro, and M. L. Jamil, “NLP-based platform as a service: a brief review,” J. Big Data, vol. 9, no. 1, 2022, doi: 10.1186/s40537-022-00603-5.

M. Chiny, M. Chihab, O. Bencharef, and Y. Chihab, “Netflix Recommendation System based on TF-IDF and Cosine Similarity Algorithms,” in Proceedings ofthe 2nd International Conference on Big Data, Modelling and Machine Learning, Science and Technology Publications, Lda, 2022, pp. 15–20. doi: 10.5220/0010727500003101.

S. Jaradat, R. Nayak, A. Paz, and M. Elhenawy, “Ensemble Learning with Pre-Trained Transformers for Crash Severity Classification: A Deep NLP Approach,” Algorithms, vol. 17, no. 7, 2024, doi: 10.3390/a17070284.

L. Nurina, S. H. Hairuddin, A. A. Bakri, and A. Pilua, “Tinjauan Bibliometrik Terhadap Pemanfaatan Big Data, Analisis Sentimen, dan Kriptokurensi dalam Analisis Pajak,” Sanskara Akunt. dan Keuang., vol. 2, no. 01, pp. 66–76, 2023, doi: 10.58812/sak.v2i01.257.

Putri Angraini Aziz, S. B. Nur Ilahi, Sumiarni Moka, and A. M. Sajiah, “Penerapan Hadoop untuk Analisis Sentimen Berbasis Big Data pada Ulasan Aplikasi Transportasi Online,” SATESI J. Sains Teknol. dan Sist. Inf., vol. 5, no. 1, pp. 51–60, 2025, doi: 10.54259/satesi.v5i1.4051.

B. Ramadhani and R. R. Suryono, “Komparasi Algoritma Naïve Bayes dan Logistic Regression Untuk Analisis Sentimen Metaverse,” J. Media Inform. Budidarma, vol. 8, no. 2, p. 714, 2024, doi: 10.30865/mib.v8i2.7458.

E. R. Lidinillah, T. Rohana, and A. R. Juwita, “Analisis sentimen twitter terhadap steam menggunakan algoritma logistic regression dan support vector machine,” TEKNOSAINS J. Sains, Teknol. dan Inform., vol. 10, no. 2, pp. 154–164, 2023, doi: 10.37373/tekno.v10i2.440.

S. Rabbani, D. Safitri, N. Rahmadhani, A. A. F. Sani, and M. K. Anam, “Perbandingan Evaluasi Kernel SVM untuk Klasifikasi Sentimen dalam Analisis Kenaikan Harga BBM: Comparative Evaluation of SVM Kernels for Sentiment Classification in Fuel Price Increase Analysis,” MALCOM Indones. J. Mach. Learn. Comput. Sci., vol. 3, no. 2, pp. 153–160, 2023, [Online]. Available: https://journal.irpi.or.id/index.php/malcom/article/view/897%0Ahttps://journal.irpi.or.id/index.php/malcom/article/download/897/421

D. Dablain, B. Krawczyk, and N. V. Chawla, “DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data,” IEEE Trans. Neural Networks Learn. Syst., vol. 34, no. 9, pp. 6390–6404, 2023, doi: 10.1109/TNNLS.2021.3136503.

I. S. Ramadhan and A. Salam, “Teknik Random Undersampling untuk Mengatasi Ketidakseimbangan Kelas pada CT Scan Kista Ginjal,” Techno.Com, vol. 23, no. 1, pp. 20–28, 2024, doi: 10.62411/tc.v23i1.9738.

Y. Wu, Z. Jin, C. Shi, P. Liang, and T. Zhan, “Research on the application of deep learning-based BERT model in sentiment analysis,” Appl. Comput. Eng., vol. 71, no. 1, pp. 14–20, 2024, doi: 10.54254/2755-2721/71/2024ma.

Y. Wen, Y. Liang, and X. Zhu, “Sentiment analysis of hotel online reviews using the BERT model and ERNIE model—Data from China,” PLoS One, vol. 18, no. 3 March, pp. 1–14, 2023, doi: 10.1371/journal.pone.0275382.

Vidya Chandradev, I Made Agus Dwi Suarjaya, and I Putu Agung Bayupati, “Analisis Sentimen Review Hotel Menggunakan Metode Deep Learning BERT,” J. Buana Inform., vol. 14, no. 02, pp. 107–116, 2023, doi: 10.24002/jbi.v14i02.7244.

A. Ripa’i, F. Santoso, and F. Lazim, “Deteksi Berita Hoax dengan Perbandingan Website Menggunakan Pendekatan Deep Learning Algoritma BERT,” G-Tech J. Teknol. Terap., vol. 8, no. 3, pp. 1749–1758, 2024, doi: 10.33379/gtech.v8i3.4541.

N. Nurwanda, N. Suarna, and W. Prihartono, “Penerapan Nlp (Natural Language Processing) Dalam Analisis Sentimen Pengguna Telegram Di Playstore,” JATI (Jurnal Mhs. Tek. Inform., vol. 8, no. 2, pp. 1841–1846, 2024, doi: 10.36040/jati.v8i2.8469.

M. R. A. Prasetya and A. M. Priyatno, “Dice Similarity and TF-IDF for New Student Admissions Chatbot,” RIGGS J. Artif. Intell. Digit. Bus., vol. 1, no. 1, pp. 13–18, 2022, doi: 10.31004/riggs.v1i1.5.

S. K. Rongali, “Natural Language Processing in Artificial Intelligence,” World J. Adv. Res. Rev., vol. 25, no. 1, pp. 1931–1935, 2025, doi: 10.1201/9780367808495.

F. M. Sinaga, W. S. Lestari, S. Winardi, and K. H. Rambe, “ENHANCING SENTIMENT ANALYSIS ACCURACY WITH BERT AND SILHOUETTE METHOD OPTIMIZATION,” JITK (Jurnal Ilmu Pengetah. dan Teknol. Komputer), vol. 11, no. 1, pp. 76–86, 2025, doi: 10.33480/jitk.v11i1.6392.Transformers.

K. Pramayasa, I. M. D. Maysanjaya, and I. G. A. A. D. Indradewi, “Analisis Sentimen Program Mbkm Pada Media Sosial Twitter Menggunakan KNN Dan SMOTE,” SINTECH (Science Inf. Technol. J., vol. 6, no. 2, pp. 89–98, 2023, doi: 10.31598/sintechjournal.v6i2.1372.

Candra, K. W. Chandra, and H. Irsyad, “Efektifitas SMOTE dalam Mengatasi Imbalanced Class Algoritma K-Nearest Neighbors pada Analisis Sentimen terhadap Starlink,” J. Ilmu Komput. dan Inform., vol. 4, no. 1, pp. 31–42, 2024, doi: 10.54082/jiki.132.

B. Kurniawan, A. Suwarisman, I. Afriyanti, A. Wahyudi, and D. D. Saputra, “Analisis Sentimen Complain dan Bukan Complain pada Twitter Telkomsel dengan SMOTE dan Naïve Bayes,” J. JTIK (Jurnal Teknol. Inf. dan Komunikasi), vol. 7, no. 1, pp. 106–113, 2023, doi: 10.35870/jtik.v7i1.691.

Z. A. Sriyanti, D. S. Y. Kartika, and A. R. E. Najaf, “Implementasi Model Bert Pada Analisis Sentimen Pengguna Twitter Terhadap Aksi Boikot Produk Israel,” J. Inform. dan Tek. Elektro Terap., vol. 12, no. 3, pp. 2335–2342, 2024, doi: 10.23960/jitet.v12i3.4743.

Ardiansyah, Adika Sri Widagdo, Krisna Nuresa Qodri, F. E. N. Saputro, and Nisrina Akbar Rizky P, “Analisis sentimen terhadap pelayanan Kesehatan berdasarkan ulasan Google Maps menggunakan BERT,” J. Fasilkom, vol. 13, no. 02, pp. 326–333, 2023, doi: 10.37859/jf.v13i02.5170.

P. Wulff, L. Mientus, A. Nowak, and A. Borowski, “Utilizing a Pretrained Language Model (BERT) to Classify Preservice Physics Teachers’ Written Reflections,” Int. J. Artif. Intell. Educ., vol. 33, no. 3, pp. 439–466, 2023, doi: 10.1007/s40593-022-00290-6.


Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel SMOTE and BERT Approaches for Handling Class Imbalance in Sentiment Analysis of the CoreTax Application on Big Data

Dimensions Badge
Article History
Submitted: 2025-08-30
Published: 2025-09-30
Abstract View: 343 times
PDF Download: 297 times
How to Cite
Ginting, M., Surbakti, A., Ilham, S., Utomo, D., & Ginting, R. (2025). SMOTE and BERT Approaches for Handling Class Imbalance in Sentiment Analysis of the CoreTax Application on Big Data. Building of Informatics, Technology and Science (BITS), 7(2), 1467-1476. https://doi.org/10.47065/bits.v7i2.8310
Section
Articles