Klasifikasi Spam Bahasa Indonesia dengan IndoBERT dan XLM-RoBERTa: Evaluasi Pooling, Stride, dan Late-Fusion

Darmono Darmono; Rujianto Eko Saputro; Azhari Shouni Barkah

doi:10.47065/bits.v7i2.8034

Darmono Darmono * Universitas Amikom Purwokerto, Banyumas, Indonesia
Rujianto Eko Saputro Universitas Amikom Purwokerto, Banyumas, Indonesia
Azhari Shouni Barkah Universitas Amikom Purwokerto, Banyumas, Indonesia

(*) Corresponding Author

DOI: https://doi.org/10.47065/bits.v7i2.8034

Keywords: Spam Detection; IndoBERT; XLM-RoBERTa; Indonesian; Truncation; Chunking; Mean Pooling

Abstract

Spam detection for Indonesian short messages such as SMS and email remains challenging due to lexical variation, character obfuscation, and class imbalance. This study provides a systematic evaluation to determine the most balanced configuration between accuracy and efficiency for Indonesian spam filtering. We compare two pretrained backbones (IndoBERT and XLM RoBERTa), along with representation strategies (truncation versus chunking), summarization schemes (pooling), and feature fusion approaches. The system follows a feature based design with an emphasis on simplicity, and is assessed using F1 Macro, spam class recall, AUPRC (Area Under the Precision Recall Curve), and efficiency metrics in terms of embedding build time and training latency. Results indicate that IndoBERT achieves superior binary classification performance with high efficiency, while XLM RoBERTa slightly outperforms on AUPRC, making it more suitable for risk ranking scenarios. Truncation combined with mean pooling consistently yields stable results. Although late fusion only provides marginal improvements, it remains relevant as it highlights the potential of domain specific signals to enhance robustness under heavy obfuscation. The final recommendation for production is IndoBERT with truncation, mean pooling, and embedding only. Limitations include the focus on short messages and the lack of evaluation under extreme obfuscation. Future work should explore character level augmentation, cross domain evaluation, and cost sensitive threshold tuning.

Downloads

Download data is not yet available.

References

V. Tandra, Y. Yowen, R. Tanjaya, W. L. Santoso, and N. N. Qomariyah, “Short Message Service Filtering With Natural Language Processing in Indonesian Language,” in 2021 Int. Conf. on ICT for Smart Society (ICISS), 2021, pp. 1–7, doi: 10.1109/ICISS53185.2021.9532503.

L. Alhaura and I. Budi, “Malicious Account Detection on Indonesian Twitter Account,” in 2020 3rd Int. Conf. on Computer and Informatics Engineering (IC2IE), 2020, pp. 176–181, doi: 10.1109/IC2IE50715.2020.9274682.

Y. Vernanda, S. Hansun, and M. B. Kristanda, “Indonesian Language Email Spam Detection Using N-Gram and Naïve Bayes Algorithm,” Bulletin of Electrical Engineering and Informatics, vol. 9, no. 5, pp. 2012–2019, 2020, doi: 10.11591/eei.v9i5.2444.

F. Astari and Devianty, “Transfer Learning Methods for Hate Speech Detection in Bahasa Indonesia,” J. Inf. Technol. Comput. Sci., 2025, doi: 10.25126/jitecs.2025101542.

A. F. Hidayatullah, R. Apong, D. T. C. Lai, and A. Qazi, “Adult Content Detection on Indonesian Tweets by Fine-Tuning Transformer-Based Models,” in 2023 6th Int. Conf. on Applied Computer and Information Systems (ACIIS), 2023, pp. 1–6, doi: 10.1109/ACIIS59385.2023.10367283.

R. A. Fitrianto, A. Editya, M. M. Alamin, A. L. Pramana, and A. K. Alhaq, “Classification of Indonesian Sarcasm Tweets on X Platform Using Deep Learning,” in 2024 7th Int. Conf. on Informatics and Computational Sciences (ICICOS), 2024, pp. 388–393, doi: 10.1109/icicos62600.2024.10636904.

A. A. Pal, S. Mondal, C. Kumar, and C. Kumar, “A Transformer-Based Approach for Fake News and Spam Detection in Social Media Using RoBERTa,” in 2025 Int. Conf. on Multi-Agent Systems and Collaborative Intelligence (ICMSCI), 2025, pp. 1256–1263, doi: 10.1109/ICMSCI62561.2025.10894342.

N. Latifah, R. Dwiyansaputra, and G. S. Nugraha, “Multiclass Text Classification of Indonesian Short Message Service (SMS) Spam Using Deep Learning Method and Easy Data Augmentation,” MATRIK: J. Manajemen, Teknik Informatika dan Rekayasa Komputer, vol. 23, no. 3, pp. 663–676, 2024, doi: 10.30812/matrik.v23i3.3835.

L. Yelamanchili, C.-S. M. Wu, C. Pollett, and R. Chun, “Multi-Label Text Classification With Transfer Learning,” in 2024 IEEE/ACIS 9th Int. Conf. on Big Data, Cloud Computing, and Data Science (BCD), 2024, pp. 21–26, doi: 10.1109/BCD61269.2024.10743077.

M. Benballa, S. Collet, and R. Picot-Clémente, “Saagie at SemEval-2019 Task 5: From Universal Text Embeddings and Classical Features to Domain-Specific Text Classification,” in Proc. 13th Int. Workshop on Semantic Evaluation (SemEval 2019), Minneapolis, MN, USA, Jun. 6–7, 2019, pp. 469–475, doi: 10.18653/v1/S19-2083.

M. Z. Koufi, Z. Guessoum, A. Keziou, I. Yahiaoui, C. Martineau, and W. Domin, “Text Chunking to Improve Website Classification,” 2023, pp. 197–216, doi: 10.1007/978-3-031-53025-8_15.

O. Coban, M. Yağanoğlu, and F. Bozkurt, “Domain Effect Investigation for BERT Models Fine-Tuned on Different Text Categorization Tasks,” Arab. J. Sci. Eng., 2023, doi: 10.1007/s13369-023-08142-8.

F. Indriani, R. A. Nugroho, M. Faisal, and D. Kartini, “Comparative Evaluation of IndoBERT, IndoBERTweet, and mBERT for Multilabel Student Feedback Classification,” J. RESTI (Rekayasa Sistem dan Teknologi Informasi), 2024, doi: 10.29207/resti.v8i6.6100.

T. I. Z. M. Putra, S. Suprapto, and A. F. Bukhori, “Model Klasifikasi Berbasis Multiclass Classification Dengan Kombinasi IndoBERT Embedding dan Long Short-Term Memory Untuk Tweet Berbahasa Indonesia,” J. Ilmu Siber dan Teknologi Digital, 2022, doi: 10.35912/jisted.v1i1.1509.

Z. Chi et al., “XLM-E: Cross-Lingual Language Model Pre-Training via ELECTRA,” 2021, pp. 6170–6182, doi: 10.18653/v1/2022.acl-long.427.

G. Subhashini, M. G., H. Ashik, and B. Duresh, “Advanced SMS Spam Detection Using Integrated Feature Extraction,” in 2024 4th Int. Conf. on Ubiquitous Computing, Intelligence and Information Systems (ICUIS), 2024, pp. 780–785, doi: 10.1109/ICUIS64676.2024.10867034.

S. Riyanto, I. Sitanggang, T. Djatna, and T. Atikah, “Comparative Analysis Using Various Performance Metrics in Imbalanced Data for Multi-Class Text Classification,” Int. J. Adv. Comput. Sci. Appl. (IJACSA), 2023, doi: 10.14569/ijacsa.2023.01406116.

J. Opitz, “A Simple but Thorough Primer on Metrics for Multi-Class Evaluation Such as Micro F1, Macro F1, Kappa and MCC,” 2022. [Online]. Available: https://consensus.app/papers/a-simple-but-thorough-primer-on-metrics-for-multiclass-opitz/33ab5081d43a5db6bb29ff8bade3b77d/

J. Leevy, T. Khoshgoftaar, and J. Hancock, “Evaluating Performance Metrics for Credit Card Fraud Classification,” in 2022 IEEE 34th Int. Conf. on Tools With Artificial Intelligence (ICTAI), 2022, pp. 1336–1341, doi: 10.1109/ICTAI56018.2022.00202.

J. Hancock, T. Khoshgoftaar, and J. Johnson, “Informative Evaluation Metrics for Highly Imbalanced Big Data Classification,” in 2022 21st IEEE Int. Conf. on Machine Learning and Applications (ICMLA), 2022, pp. 1419–1426, doi: 10.1109/ICMLA55696.2022.00224.

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-Trained Language Model for Indonesian NLP,” in Proc. 28th Int. Conf. on Computational Linguistics (COLING), Dec. 2020, pp. 757–770, doi: 10.18653/v1/2020.coling-main.66.

J. Sirusstara, N. Alexander, A. Alfarisy, S. Achmad, and R. Sutoyo, “Clickbait Headline Detection in Indonesian News Sites Using Robustly Optimized BERT Pre-Training Approach (RoBERTa),” in 2022 3rd Int. Conf. on Artificial Intelligence and Data Sciences (AiDAS), 2022, pp. 1–6, doi: 10.1109/AiDAS56890.2022.9918678.

F. Muftie and M. Haris, “IndoBERT-Based Data Augmentation for Indonesian Text Classification,” in 2023 Int. Conf. on Information Technology Research and Innovation (ICITRI), 2023, pp. 128–132, doi: 10.1109/ICITRI59340.2023.10250061.

M. Y. Ridho and E. Yulianti, “From Text to Truth: Leveraging IndoBERT and Machine Learning Models for Hoax Detection in Indonesian News,” J. Ilm. Tek. Elektro Komput. dan Inform., vol. 10, no. 3, pp. 544–555, 2024, doi: 10.26555/jiteki.v10i3.29450.

A. F. Hidayatullah, R. A. Apong, D. T. C. Lai, and A. Qazi, “Word Level Language Identification in Indonesian–Javanese–English Code-Mixed Text,” Procedia Comput. Sci., vol. 244, pp. 105–112, 2024, doi: 10.1016/j.procs.2024.10.183.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Klasifikasi Spam Bahasa Indonesia dengan IndoBERT dan XLM-RoBERTa: Evaluasi Pooling, Stride, dan Late-Fusion