Implemetasi TF-IDF N-Gram dan Algoritma Nearest Centroid untuk Klasifikasi Topik Tugas Akhir
Abstract
This study presents a lightweight and explainable workflow for curating undergraduate thesis titles in the Informatics Engineering Study Program by combining TF-IDF n-gram (1–2) features with a cosine based Nearest Centroid classifier. Titles are grouped into three internal research area classes, RPLD, SC, and SKKKD, to support topic grouping and supervisor assignment. The approach is implemented as a Streamlit web application that supports Excel upload with preview and persistent saving, column standardization, text normalization, duplicate rejection using normalized titles, rapid training on labeled data, topic prediction for new titles, and retrieval of the most similar titles to assist curation. A key operational contribution is the direct linkage from predicted classes to the program maintained lecturer list for each area, enabling students to identify suitable supervisors and helping coordinators run a consistent and auditable workflow. On a multi semester corpus of 1,057 titles, stratified 5-fold cross-validation achieved 92.43 percent average accuracy, Macro F1 of 0.875, Micro F1 of 0.924, and Weighted F1 of 0.925, indicating a balance between accuracy, efficiency, and interpretability for short text. Decision inspection is supported by class specific top terms and nearest neighbor title lists. Limitations mainly stem from the minority class, therefore future work will expand labeled corpora, add character level n grams, and explore lightweight hybrid representations.
Downloads
References
S. Chawla, R. Kaur, and P. Aggarwal, “Text classification framework for short text based on TFIDF-FastText,” Multimed Tools Appl, vol. 82, no. 26, pp. 40167–40180, Nov. 2023, doi: 10.1007/s11042-023-15211-5.
Z. Khan, U. Naseer, and M. A. Tahir, “Short Text Classification using TF-IDF Features and FastText Learner,” in Working Notes Proceedings of the MediaEval 2021 Workshop, 2021. [Online]. Available: https://ceur-ws.org/Vol-3181/paper59.pdf
A. D. D. Wibiyanto and A. Wibowo, “PENERAPAN ALGORITMA MULTICLASS SUPPORT VECTOR MACHINE DAN TF-IDF UNTUK KLASIFIKASI TOPIK TUGAS AKHIR,” SKANIKA, vol. 6, no. 1, pp. 42–50, Jan. 2023, doi: 10.36080/skanika.v6i1.2999.
A. H. Nasrullah, “Integrasi TF-IDF dan Algoritma Cosine Similarity untuk Deteksi Tingkat Kemiripan Judul Tugas Akhir,” INTEC Journal: Information Technology and Education, vol. 4, no. 1, pp. 1–10, 2024, Accessed: Dec. 15, 2025. [Online]. Available: https://journal.unm.ac.id/index.php/INTEC/article/view/5810
D. Meidelfi, - Yulherniwati, I. Rahmayuni, T. Hidayat, and D. Chandra, “TF-IDF Implementation for Similarity Checker on The Final Project Title,” International Journal of Advanced Science Computing and Engineering, vol. 3, no. 1, pp. 40–52, Oct. 2021, doi: 10.62527/ijasce.3.1.3.
R. Ardianzah and H. Thamrin, “Pengembangan Sistem Pencarian Pada Aplikasi Skripsi Untuk Meningkatkan Hasil Pencarian Judul,” 2024. [Online]. Available: https://eprints.ums.ac.id/120772/1/Naskah%20Publikasi.pdf
F. D. Astuti and W. Andriyani, “Pengembangan Sistem Rekomendasi Pembimbing Tugas Akhir Menggunakan Teknik Content Based Filtering,” JIKO (Jurnal Informatika dan Komputer), vol. 9, no. 2, p. 474, Jun. 2025, doi: 10.26798/jiko.v9i2.1599.
I. Mawanta, T. S. Gunawan, and W. Wanayumini, “Uji Kemiripan Kalimat Judul Tugas Akhir dengan Metode Cosine Similarity dan Pembobotan TF-IDF,” JURNAL MEDIA INFORMATIKA BUDIDARMA, vol. 5, no. 2, p. 726, Apr. 2021, doi: 10.30865/mib.v5i2.2935.
Ritzkal, W. T. Atmojo, P. Novantara, S. Rosidin, A. D. Jubaedi, and E. Novianto, “Improving Thesis Title Classification Accuracy Using Ensemble Classifier and Modified Chi-Square Feature Selection Method,” Indonesian Applied Research on Computing and Informatics, vol. 1, no. 1, pp. 37–47, 2025, Accessed: Dec. 15, 2025. [Online]. Available: https://jurnal.tdinus.com/index.php/iarci/article/view/52
J.-W. Sun, J.-Q. Bao, and L.-P. Bu, “Text Classification Algorithm Based on TF-IDF and BERT,” in 2022 11th International Conference of Information and Communication Technology (ICTech)), IEEE, Feb. 2022, pp. 1–4. doi: 10.1109/ICTech55460.2022.00112.
F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” in Proceedings of the 28th International Conference on Computational Linguistics, Stroudsburg, PA, USA: International Committee on Computational Linguistics, 2020, pp. 757–770. doi: 10.18653/v1/2020.coling-main.66.
Sutriawan, S. Rustad, G. F. Shidik, and Pujiono, “Performance Evaluation of Text Embedding Models for Ambiguity Classification in Indonesian News Corpus: A Comparative Study of TF-IDF, Word2Vec, FastText BERT, and GPT,” Ingénierie des systèmes d information, vol. 30, no. 6, pp. 1469–1482, Jun. 2025, doi: 10.18280/isi.300606.
M. Liang and T. Niu, “Research on Text Classification Techniques Based on Improved TF-IDF Algorithm and LSTM Inputs,” Procedia Comput Sci, vol. 208, pp. 460–470, 2022, doi: 10.1016/j.procs.2022.10.064.
C. Li, Z. Xie, and H. Wang, “Short Text Classification Based on Enhanced Word Embedding and Hybrid Neural Networks,” Applied Sciences, vol. 15, no. 9, p. 5102, May 2025, doi: 10.3390/app15095102.
P. Sayarizki, Hasmawanti, and H. Nurrahmi, “Implementation of IndoBERT for Sentiment Analysis of Indonesian Presidential Candidates,” Indonesian Journal of Computing, vol. 9, no. 2, pp. 1–11, 2024, doi: 10.34818/INDOJC.2024.9.2.934.
D. E. Cahyani and I. Patasik, “Performance comparison of TF-IDF and Word2Vec models for emotion text classification,” Bulletin of Electrical Engineering and Informatics, vol. 10, no. 5, pp. 2780–2788, Oct. 2021, doi: 10.11591/eei.v10i5.3157.
E. Yuniar and N. Hendrastuty, “Perbandingan Metode Naive Bayes, Random Forest dan SVM untuk Analisis Sentimen pada Twitter tentang Kenaikan Gaji Guru,” Building of Informatics, Technology and Science (BITS), vol. 6, no. 4, pp. 2469–2479, 2025, doi: 10.47065/bits.v6i4.6970.
Dwi Nanda Agustia and Ryan Randy Suryono, “Comparison of Naïve Bayes, Random Forest, and Logistic Regression Algorithms for Sentiment Analysis Online Gambling,” INOVTEK Polbeng - Seri Informatika, vol. 10, no. 1, pp. 284–295, Jan. 2025, doi: 10.35314/prk93630.
C. C. Aggarwal, Machine Learning for Text. Cham: Springer International Publishing, 2018. doi: 10.1007/978-3-319-73531-3.
S. Raschka, Y. (Hayden) Liu, and V. Mirjalili, Machine Learning with PyTorch and Scikit-Learn: Develop Machine Learning and Deep Learning Models with Python. Birmingham: Packt Publishing, 2022.
J. Albrecht, S. Ramachandran, and C. Winkler, Blueprints for Text Analytics Using Python: Machine Learning-Based Solutions for Common Real World (NLP) Applications. Sebastopol, CA: O’Reilly Media, 2021.
I. Nyoman Prayana Trisna, N. Wayan Emmy Rosiana Dewi, and M. Alam Pasirulloh, “Oversampling vs. undersampling in TF-IDF variations for imbalanced Indonesian short texts classification,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 23, no. 2, p. 382, Apr. 2025, doi: 10.12928/telkomnika.v23i2.26510.
B. Trstenjak, S. Mikac, and D. Donko, “KNN with TF-IDF based Framework for Text Categorization,” Procedia Eng, vol. 69, pp. 1356–1364, 2014, doi: 10.1016/j.proeng.2014.03.129.
N. Arifin, U. Enri, and N. Sulistiyowati, “Penerapan Algoritma Support Vector Machine (SVM) dengan TF-IDF N-Gram untuk Text Classification,” STRING (Satuan Tulisan Riset dan Inovasi Teknologi), vol. 6, no. 2, p. 129, Dec. 2021, doi: 10.30998/string.v6i2.10133.
I. R. Illahi and E. B. Setiawan, “Sentiment Analysis on Social Media Using Fasttext Feature Expansion and Recurrent Neural Network (RNN) with Genetic Algorithm Optimization,” International Journal on Information and Communication Technology (IJoICT), vol. 10, no. 1, pp. 78–89, Jun. 2024, doi: 10.21108/ijoict.v10i1.905.
D. R. Firmansyah and E. Lestariningsih, “Analisis Sentimen Ulasan Aplikasi Smart Campus Unisbank di Google Playstore Menggunakan Algoritma Naive Bayes,” Jurnal JTIK (Jurnal Teknologi Informasi dan Komunikasi), vol. 8, no. 2, pp. 498–507, Apr. 2024, doi: 10.35870/jtik.v8i2.1882.
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel Implemetasi TF-IDF N-Gram dan Algoritma Nearest Centroid untuk Klasifikasi Topik Tugas Akhir
Pages: 1963-1973
Copyright (c) 2025 Rohima Choirul Hana, Defri Kurniawan

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).





















