Implemetasi TF-IDF N-Gram dan Algoritma Nearest Centroid untuk Klasifikasi Topik Tugas Akhir


  • Rohima Choirul Hana Universitas Dian Nuswantoro, Semarang, Indonesia
  • Defri Kurniawan * Mail Universitas Dian Nuswantoro, Semarang, Indonesia
  • (*) Corresponding Author
Keywords: TF-IDF; Nearest Centroid; Thesis Title Classification; Cosine Similarity; Streamlit

Abstract

This study presents a lightweight and explainable workflow for curating undergraduate thesis titles in the Informatics Engineering Study Program by combining TF-IDF n-gram (1–2) features with a cosine based Nearest Centroid classifier. Titles are grouped into three internal research area classes, RPLD, SC, and SKKKD, to support topic grouping and supervisor assignment. The approach is implemented as a Streamlit web application that supports Excel upload with preview and persistent saving, column standardization, text normalization, duplicate rejection using normalized titles, rapid training on labeled data, topic prediction for new titles, and retrieval of the most similar titles to assist curation. A key operational contribution is the direct linkage from predicted classes to the program maintained lecturer list for each area, enabling students to identify suitable supervisors and helping coordinators run a consistent and auditable workflow. On a multi semester corpus of 1,057 titles, stratified 5-fold cross-validation achieved 92.43 percent average accuracy, Macro F1 of 0.875, Micro F1 of 0.924, and Weighted F1 of 0.925, indicating a balance between accuracy, efficiency, and interpretability for short text. Decision inspection is supported by class specific top terms and nearest neighbor title lists. Limitations mainly stem from the minority class, therefore future work will expand labeled corpora, add character level n grams, and explore lightweight hybrid representations.

Downloads

Download data is not yet available.

References

S. Chawla, R. Kaur, and P. Aggarwal, “Text classification framework for short text based on TFIDF-FastText,” Multimed Tools Appl, vol. 82, no. 26, pp. 40167–40180, Nov. 2023, doi: 10.1007/s11042-023-15211-5.

Z. Khan, U. Naseer, and M. A. Tahir, “Short Text Classification using TF-IDF Features and FastText Learner,” in Working Notes Proceedings of the MediaEval 2021 Workshop, 2021. [Online]. Available: https://ceur-ws.org/Vol-3181/paper59.pdf

A. D. D. Wibiyanto and A. Wibowo, “PENERAPAN ALGORITMA MULTICLASS SUPPORT VECTOR MACHINE DAN TF-IDF UNTUK KLASIFIKASI TOPIK TUGAS AKHIR,” SKANIKA, vol. 6, no. 1, pp. 42–50, Jan. 2023, doi: 10.36080/skanika.v6i1.2999.

A. H. Nasrullah, “Integrasi TF-IDF dan Algoritma Cosine Similarity untuk Deteksi Tingkat Kemiripan Judul Tugas Akhir,” INTEC Journal: Information Technology and Education, vol. 4, no. 1, pp. 1–10, 2024, Accessed: Dec. 15, 2025. [Online]. Available: https://journal.unm.ac.id/index.php/INTEC/article/view/5810

D. Meidelfi, - Yulherniwati, I. Rahmayuni, T. Hidayat, and D. Chandra, “TF-IDF Implementation for Similarity Checker on The Final Project Title,” International Journal of Advanced Science Computing and Engineering, vol. 3, no. 1, pp. 40–52, Oct. 2021, doi: 10.62527/ijasce.3.1.3.

R. Ardianzah and H. Thamrin, “Pengembangan Sistem Pencarian Pada Aplikasi Skripsi Untuk Meningkatkan Hasil Pencarian Judul,” 2024. [Online]. Available: https://eprints.ums.ac.id/120772/1/Naskah%20Publikasi.pdf

F. D. Astuti and W. Andriyani, “Pengembangan Sistem Rekomendasi Pembimbing Tugas Akhir Menggunakan Teknik Content Based Filtering,” JIKO (Jurnal Informatika dan Komputer), vol. 9, no. 2, p. 474, Jun. 2025, doi: 10.26798/jiko.v9i2.1599.

I. Mawanta, T. S. Gunawan, and W. Wanayumini, “Uji Kemiripan Kalimat Judul Tugas Akhir dengan Metode Cosine Similarity dan Pembobotan TF-IDF,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 2, p. 726, Apr. 2021, doi: 10.30865/mib.v5i2.2935.

Ritzkal, W. T. Atmojo, P. Novantara, S. Rosidin, A. D. Jubaedi, and E. Novianto, “Improving Thesis Title Classification Accuracy Using Ensemble Classifier and Modified Chi-Square Feature Selection Method,” Indonesian Applied Research on Computing and Informatics, vol. 1, no. 1, pp. 37–47, 2025, Accessed: Dec. 15, 2025. [Online]. Available: https://jurnal.tdinus.com/index.php/iarci/article/view/52

J.-W. Sun, J.-Q. Bao, and L.-P. Bu, “Text Classification Algorithm Based on TF-IDF and BERT,” in 2022 11th International Conference of Information and Communication Technology (ICTech)), IEEE, Feb. 2022, pp. 1–4. doi: 10.1109/ICTech55460.2022.00112.

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” in Proceedings of the 28th International Conference on Computational Linguistics, Stroudsburg, PA, USA: International Committee on Computational Linguistics, 2020, pp. 757–770. doi: 10.18653/v1/2020.coling-main.66.

Sutriawan, S. Rustad, G. F. Shidik, and Pujiono, “Performance Evaluation of Text Embedding Models for Ambiguity Classification in Indonesian News Corpus: A Comparative Study of TF-IDF, Word2Vec, FastText BERT, and GPT,” Ingénierie des systèmes d information, vol. 30, no. 6, pp. 1469–1482, Jun. 2025, doi: 10.18280/isi.300606.

M. Liang and T. Niu, “Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and LSTM Inputs,” Procedia Comput Sci, vol. 208, pp. 460–470, 2022, doi: 10.1016/j.procs.2022.10.064.

C. Li, Z. Xie, and H. Wang, “Short Text Classification Based on Enhanced Word Embedding and Hybrid Neural Networks,” Applied Sciences, vol. 15, no. 9, p. 5102, May 2025, doi: 10.3390/app15095102.

P. Sayarizki, Hasmawanti, and H. Nurrahmi, “Implementation of IndoBERT for Sentiment Analysis of Indonesian Presidential Candidates,” Indonesian Journal of Computing, vol. 9, no. 2, pp. 1–11, 2024, doi: 10.34818/INDOJC.2024.9.2.934.

D. E. Cahyani and I. Patasik, “Performance comparison of TF-IDF and Word2Vec models for emotion text classification,” Bulletin of Electrical Engineering and Informatics, vol. 10, no. 5, pp. 2780–2788, Oct. 2021, doi: 10.11591/eei.v10i5.3157.

E. Yuniar and N. Hendrastuty, “Perbandingan Metode Naive Bayes, Random Forest dan SVM untuk Analisis Sentimen pada Twitter tentang Kenaikan Gaji Guru,” Building of Informatics, Technology and Science (BITS), vol. 6, no. 4, pp. 2469–2479, 2025, doi: 10.47065/bits.v6i4.6970.

Dwi Nanda Agustia and Ryan Randy Suryono, “Comparison of Naïve Bayes, Random Forest, and Logistic Regression Algorithms for Sentiment Analysis Online Gambling,” INOVTEK Polbeng - Seri Informatika, vol. 10, no. 1, pp. 284–295, Jan. 2025, doi: 10.35314/prk93630.

C. C. Aggarwal, Machine Learning for Text. Cham: Springer International Publishing, 2018. doi: 10.1007/978-3-319-73531-3.

S. Raschka, Y. (Hayden) Liu, and V. Mirjalili, Machine Learning with PyTorch and Scikit-Learn: Develop Machine Learning and Deep Learning Models with Python. Birmingham: Packt Publishing, 2022.

J. Albrecht, S. Ramachandran, and C. Winkler, Blueprints for Text Analytics Using Python: Machine Learning-Based Solutions for Common Real World (NLP) Applications. Sebastopol, CA: O’Reilly Media, 2021.

I. Nyoman Prayana Trisna, N. Wayan Emmy Rosiana Dewi, and M. Alam Pasirulloh, “Oversampling vs. undersampling in TF-IDF variations for imbalanced Indonesian short texts classification,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 23, no. 2, p. 382, Apr. 2025, doi: 10.12928/telkomnika.v23i2.26510.

B. Trstenjak, S. Mikac, and D. Donko, “KNN with TF-IDF based Framework for Text Categorization,” Procedia Eng, vol. 69, pp. 1356–1364, 2014, doi: 10.1016/j.proeng.2014.03.129.

N. Arifin, U. Enri, and N. Sulistiyowati, “Penerapan Algoritma Support Vector Machine (SVM) dengan TF-IDF N-Gram untuk Text Classification,” STRING (Satuan Tulisan Riset dan Inovasi Teknologi), vol. 6, no. 2, p. 129, Dec. 2021, doi: 10.30998/string.v6i2.10133.

I. R. Illahi and E. B. Setiawan, “Sentiment Analysis on Social Media Using Fasttext Feature Expansion and Recurrent Neural Network (RNN) with Genetic Algorithm Optimization,” International Journal on Information and Communication Technology (IJoICT), vol. 10, no. 1, pp. 78–89, Jun. 2024, doi: 10.21108/ijoict.v10i1.905.

D. R. Firmansyah and E. Lestariningsih, “Analisis Sentimen Ulasan Aplikasi Smart Campus Unisbank di Google Playstore Menggunakan Algoritma Naive Bayes,” Jurnal JTIK (Jurnal Teknologi Informasi dan Komunikasi), vol. 8, no. 2, pp. 498–507, Apr. 2024, doi: 10.35870/jtik.v8i2.1882.


Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Implemetasi TF-IDF N-Gram dan Algoritma Nearest Centroid untuk Klasifikasi Topik Tugas Akhir

Dimensions Badge
Article History
Submitted: 2025-12-05
Published: 2025-12-26
Abstract View: 454 times
PDF Download: 322 times
How to Cite
Hana, R., & Kurniawan, D. (2025). Implemetasi TF-IDF N-Gram dan Algoritma Nearest Centroid untuk Klasifikasi Topik Tugas Akhir. Building of Informatics, Technology and Science (BITS), 7(3), 1963-1973. https://doi.org/10.47065/bits.v7i3.8859
Issue
Section
Articles

Most read articles by the same author(s)