Klasifikasi Sentimen Menggunakan Metode Passive Aggressive dengan Menggunakan Model Bahasa BERT pada Dataset Kecil


  • Yazid Abdullah Subhi Universitas Islam Negeri Sultan Syarif Kasim Riau, Pekanbaru, Indonesia
  • Surya Agustian * Mail Universitas Islam Negeri Sultan Syarif Kasim Riau, Pekanbaru, Indonesia
  • Muhammad Irsyad Universitas Islam Negeri Sultan Syarif Kasim Riau, Pekanbaru, Indonesia
  • Fitri Insani Universitas Islam Negeri Sultan Syarif Kasim Riau, Pekanbaru, Indonesia
  • (*) Corresponding Author
Keywords: Classification; Small Dataset; BERT; Passive Aggressive

Abstract

Text classification is one of the most popular tasks in natural language processing, especially in the context of sentiment classification. Insufficient training data poses a significant challenge in many text classification studies. This research focuses on optimizing classification performance using the Passive Aggressive (PA) algorithm, leveraging limited training data. It compares conventional text representation methods like TF-IDF with modern approaches employing word embeddings such as FastText and BERT. The primary dataset encompasses sentiment issues related to Kaesang Pangarep's appointment as the chairman of PSI, gathered through Twitter crawling, and classified into positive, negative, and neutral sentiment labels. Two versions of the training data, each containing only 300 balanced tweets for positive, negative, and neutral classes, were used. The data was split 80% for training and 20% for validation in the search for an optimal model. External data with different issues and pre-existing sentiment labels was used to augment the training data. Experimental results demonstrated that the BERT language model, used as input features for the Passive Aggressive method with hyperparameter tuning, outperformed TF-IDF features. Evaluation on the test data revealed that BERT features with Passive Aggressive achieved an F1-score of 0.52, surpassing the conventional TF-IDF representation with an F1-score of 0.42. The utilization of the BERT language model significantly contributed to improving text classification performance in the field of natural language processing, particularly for the Passive Aggressive method.

Downloads

Download data is not yet available.

References

J. Cai, J. Li, W. Li, and J. Wang, “Deeplearning Model Used in Text Classification,” in 2018 15th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), IEEE, Dec. 2018, pp. 123–126. doi: 10.1109/ICCWAMTIP.2018.8632592.

S. Masripah and L. Yusuf, “Perbandingan Kriteria Decision Tree pada Pengetahuan Masyarakat pada Pemilihan Umum Presiden Indonesia,” INTI Nusa Mandiri, vol. 18, no. 2, pp. 183–191, Feb. 2024, doi: 10.33480/inti.v18i2.5065.

D. N. Fathurrahman, A. B. Osmond, and R. E. Saputra, “Deep Neural Network untuk Pengenalan Ucapan pada Bahasa Sunda Dialek Tengah Timur (Majalengka),” vol. 5, p. 6073, Dec. 2018. [Online]. Available: https://openlibrarypublications.telkomuniversity.ac.id/index.php/engineering/article/view/7967/7858

H. H. Sinaga, “Perbandingan Metode Decision Tree Dan XGBOOST Untuk Klasifikasi Sentimen Vaksin Covid-19 Di Twitter,” UIN Sultan Syarif Kasim, Pekanbaru, 2022, [Online]. Available: https://repository.uin-suska.ac.id/65475/1/Habib%20Hakim%20Sinaga%20Repository.pdf

M. M. Kusari and S. Agustian, “SVM Method with FastText Representation Feature for Classification of Twitter Sentiments Regarding the Covid-19 Vaccination Program,” Jurnal Teknologi Informasi & Komunikasi Digital Zone, vol. 13, pp. 140–150, May 2022, doi: 10.31849/digitalzone.v13i2.11531

A. Zikri, “Penerapan Support Vector Machine Dan FastText Untuk Mendeteksi Hate Speech Dan Abusive Pada Twitter,” UIN Sultan Syarif Kasim, Pekanbaru, 2023. [Online]. Available: http://repository.uin-suska.ac.id/id/eprint/74125

S. Agustian, M. Irfan Syah, N. Fatiara, and R. Abdillah, “New Directions in Text Classification Research: Maximizing The Performance of Sentiment Classification from Limited Data Arah Baru Penelitian Klasifikasi Teks: Memaksimalkan Kinerja Klasifikasi Sentimen dari Data Terbatas,” pp. 1–10, 2024, [Online]. Available: https://github.com/s4gustian/Small_DataSet_Sentiment_Classification

A. Vatsa, A. Kumar, S. Vats, and A. Kumar, “Comparing the Performance of Classification Algorithms for Melanoma Skin Cancer,” in 13th IEEE Integrated STEM Education Conference, ISEC 2023, Institute of Electrical and Electronics Engineers Inc., Mar. 2023, pp. 375–380. doi: 10.1109/ISEC57711.2023.10402205.

R. Kumar, “Fake News Detection using Passive Aggressive and TF-IDF Vectorizer,” International Research Journal of Engineering and Technology, pp. 902–904, Dec. 2020, [Online]. Available: https://www.irjet.net/archives/V7/i9/IRJET-

K. Shridhar et al., “Subword Semantic Hashing for Intent Classification on Small Datasets,” Sep. 2019, doi: 10.1109/IJCNN.2019.8852420.

M. Ostendorff, P. Bourgonje, M. Berger, J. Moreno-Schneider, G. Rehm, and B. Gipp, “Enriching BERT with Knowledge Graph Embeddings for Document Classification,” Sep. 2019, [Online]. Available: http://arxiv.org/abs/1909.08402

S. A. Sazan, M. H. Miraz, and A. B. M. Muntasir Rahman, “Enhancing Depressive Post Detection in Bangla: A Comparative Study of TF-IDF, BERT and FastText Embeddings,” Annals of Emerging Technologies in Computing, vol. 8, no. 3, pp. 34–49, 2024, doi: 10.33166/AETiC.2024.03.003.

N. A. Pramudhyta and M. S. Rohman, “Perbandingan Optimasi Metode Grid Search dan Random Search dalam Algoritma XGBoost untuk Klasifikasi Stunting,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 8, no. 1, pp. 19–29, Jan. 2024, doi: 10.30865/mib.v8i1.6965.

D. Satriani, L. U. Khasanah, and N. A. Rizki, "Penerapan Metode Grid-Search dalam Menentukan Parameter Model Pertumbuhan Penduduk di Kota Samarinda," Prosiding Seminar Nasional Matematika dan Statistika, vol. 1, pp. 65–74, 2019. [Online]. Available: https://jurnal.fmipa.unmul.ac.id/index.php/SNMSA/article/view/528

M. A. Rosid, A. S. Fitrani, I. R. I. Astutik, N. I. Mulloh, and H. A. Gozali, “Improving Text Preprocessing for Student Complaint Document Classification Using Sastrawi,” in IOP Conference Series: Materials Science and Engineering, Institute of Physics Publishing, Jul. 2020, pp. 1–6. doi: 10.1088/1757-899X/874/1/012017.

A. Tabassum and R. R. Patil, “A Survey on Text Pre-Processing & Feature Extraction Techniques in Natural Language Processing,” International Research Journal of Engineering and Technology, vol. 7, pp. 4864–4867, Jun. 2020, [Online]. Available: https://irjet.net/archives/V7/i6/IRJET-V7I6913.pdf

J. Andre Septian, T. Maulana Fahrudin, and A. Nugroho, “Analisis Sentimen Pengguna Twitter Terhadap Polemik Persepakbolaan Indonesia Menggunakan Pembobotan TF-IDF dan K-Nearest Neighbor,” JOURNAL OF INTELLIGENT SYSTEMS AND COMPUTATION, pp. 43–49, Aug. 2019, [Online]. Available: https://t.co/9WloaWpfD5

N. Badri, F. Kboubi, and A. H. Chaibi, “Combining FastText and Glove Word Embedding for Offensive and Hate speech Text Detection,” in Procedia Computer Science, Elsevier B.V., 2022, pp. 769–778. doi: 10.1016/j.procs.2022.09.132.

A. Amalia, O. S. Sitompul, E. B. Nababan, and T. Mantoro, “An Efficient Text Classification Using fastText for Bahasa Indonesia Documents Classification,” in 2020 International Conference on Data Science, Artificial Intelligence, and Business Analytics (DATABIA), IEEE, Jul. 2020, pp. 69–75. doi: 10.1109/DATABIA50434.2020.9190447.

A. R. Hanum et al., “Analisis Kinerja Algoritma Klasifikasi Teks BERT Dalam Mendeteksi Berita Hoaks,” Jurnal Teknologi Informasi dan Ilmu Komputer (JTIIK), vol. 11, no. 3, pp. 537–546, Jun. 2024, doi: 10.25126/jtiik938093.

F. Koto, J. H. Lau, and T. Baldwin, “IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization,” p. 1, Sep. 2021, [Online]. Available: http://arxiv.org/abs/2109.04607

K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer†, “Online Passive-Aggressive Algorithms,” Journal of Machine Learning Research, vol. 7, pp. 551–585, Jun. 2006. [Online]. Available: https://www.jmlr.org/papers/volume7/crammer06a/crammer06a.pdf

J. Wang and S. Zhang, “PA-PseU: An incremental passive-aggressive based method for identifying RNA pseudouridine sites via Chou’s 5-steps rule,” Chemometrics and Intelligent Laboratory Systems, vol. 210, pp. 1–12, Mar. 2021, doi: 10.1016/j.chemolab.2021.104250.

N. Ahmed and M. Rawat, “Identification of Fake News using Machine Learning and Deep Learning,” in 2023 International Conference on IoT, Communication and Automation Technology (ICICAT), IEEE, Jun. 2023, pp. 1–5. doi: 10.1109/ICICAT57735.2023.10263681.

C. Huang, Y. Li, and X. Yao, “A Survey of Automatic Parameter Tuning Methods for Metaheuristics,” Apr. 01, 2020, Institute of Electrical and Electronics Engineers Inc. doi: 10.1109/TEVC.2019.2921598.

M. Fajri and A. Primajaya, “Komparasi Teknik Hyperparameter Optimization pada SVM untuk Permasalahan Klasifikasi dengan Menggunakan Grid Search dan Random Search,” Journal of Applied Informatics and Computing (JAIC), vol. 7, no. 1, pp. 10–15, Jun. 2023

P. Liashchynskyi and P. Liashchynskyi, “Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS,” pp. 1–11, Dec. 2019, [Online]. Available: http://arxiv.org/abs/1912.06059

P. Yohana, S. Agustian, S.K. Gusti, “Klasifikasi Sentimen Masyarakat terhadap Kebijakan Vaksin Covid-19 pada Twitter dengan Imbalance Classes Menggunakan Naive Bayes”, Seminar Nasional Teknologi Informasi Komunikasi dan Industri 14, 2022. [Online]. Available: https://ejournal.uin-suska.ac.id/index.php/SNTIKI/article/view/19012/8336

F. Koto, J. H. Lau, and T. Baldwin, “IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Sep. 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2109.04607

J. Pranata, S. Agustian, J. Jasril, E. Haerani, "Penggunaan Model Bahasa indoBERT pada metode Random Forest untuk Klasifikasi Sentimen dengan Dataset Terbatas," to be appeard in Building of Informatics, Technology and Science (BITS), Vol 6 No 3 (2024): December 2024, doi: https://doi.org/10.47065/bits.v6i3.6335


Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Klasifikasi Sentimen Menggunakan Metode Passive Aggressive dengan Menggunakan Model Bahasa BERT pada Dataset Kecil

Dimensions Badge
Article History
Submitted: 2024-12-03
Published: 2024-12-25
Abstract View: 87 times
PDF Download: 51 times
How to Cite
Subhi, Y., Agustian, S., Irsyad, M., & Insani, F. (2024). Klasifikasi Sentimen Menggunakan Metode Passive Aggressive dengan Menggunakan Model Bahasa BERT pada Dataset Kecil. Building of Informatics, Technology and Science (BITS), 6(3), 1838-1847. https://doi.org/10.47065/bits.v6i3.6389
Issue
Section
Articles

Most read articles by the same author(s)

1 2 > >>