Optimalisasi Model BioBERT untuk Pengenalan Entitas pada Teks Medis dengan Conditional Random Fields (CRF)

Cynthia Dwi Nafanda; Abu Salam

doi:10.47065/bits.v6i4.7042

Cynthia Dwi Nafanda Universitas Dian Nuswantoro, Semarang, Indonesia
Abu Salam * Universitas Dian Nuswantoro, Semarang, Indonesia

(*) Corresponding Author

DOI: https://doi.org/10.47065/bits.v6i4.7042

Keywords: Named Entity Recognition (NER); BioBERT; Conditional Random Fields (CRF); Class Weight; Hyperparameter Tuning

Abstract

This research evaluates the performance of various models in the Named Entity Recognition (NER) task for medical entities, focusing on imbalanced datasets. Six BioBERT model configurations were tested, incorporating optimization techniques such as Class Weight, Conditional Random Fields (CRF), and Hyperparameter Tuning. The evaluation was conducted using Precision, Recall, and F1-Score metrics, which are particularly relevant in the context of NER, especially for addressing class imbalance in the data. The dataset used is BC5CDR, which targets chemical and disease entities in unstructured medical texts from PubMed. The data was divided into three parts: a training dataset for model training, a validation dataset for model tuning, and a test dataset for performance evaluation. The dataset was split evenly to ensure unbiased model testing, leading to more accurate results that can serve as a reference for developing more efficient medical NER systems. The evaluation results indicate that BioBERT + CRF is the model with an F1-Score that reflects an optimal balance between Precision (ranked 3rd, 0.6067 for B-Chemical, 0.5594 for B-Disease, 0.4600 for I-Disease, and 0.5083 for I-Chemical) and Recall (ranked 3rd, 0.5580 for B-Chemical, 0.4491 for B-Disease, 0.5718 for I-Disease, and 0.3840 for I-Chemical) compared to other models. This model proved to be more accurate in detecting medical entities without compromising prediction precision. The model's stability is also enhanced by a smaller gap between Precision and Recall, making it the best choice for NER in medical texts. The application of early stopping techniques effectively prevented overfitting, ensuring the model learned optimally without losing generalization. With better balance in recognizing medical entities from unstructured texts, this model presents the most effective approach for NER systems in the medical domain.

Downloads

Download data is not yet available.

References

M. Sumarudin and M. Syafrullah, “Named Entity Recognition In Electronic Medical Records Based On Hybrid Neural Network And Transformer,” Eduvest - J. Univers. Stud., vol. 4, no. 6, pp. 5263–5279, Jun. 2024, doi: 10.59188/eduvest.v4i6.1473.

S. Liu, A. Wang, X. Xiu, M. Zhong, and S. Wu, “Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study,” JMIR Med. Inform., vol. 12, p. e59782, Oct. 2024, doi: 10.2196/59782.

M. D. R. Sutrisno, D. Richasdy, and A. F. Ihsan, “The Organization Entity Extraction Telkom University Affiliated using Recurrent Neural Network (RNN),” Build. Inform. Technol. Sci. BITS, vol. 4, no. 2, pp. 483–489, Sep. 2022, doi: 10.47065/bits.v4i2.1956.

P. Chen, M. Zhang, X. Yu, and S. Li, “Named entity recognition of Chinese electronic medical records based on a hybrid neural network and medical MC-BERT,” BMC Med. Inform. Decis. Mak., vol. 22, no. 1, p. 315, Dec. 2022, doi: 10.1186/s12911-022-02059-2.

V. Kocaman and D. Talby, “Accurate Clinical and Biomedical Named Entity Recognition at Scale,” Softw. Impacts, vol. 13, p. 100373, Aug. 2022, doi: 10.1016/j.simpa.2022.100373.

A. F. Abdillah, D. Purwitasari, S. Juanita, and M. H. Purnomo, “Pengenalan Entitas Biomedis dalam Teks Konsultasi Kesehatan Online Berbahasa Indonesia Berbasis Arsitektur Transformers,” J. Teknol. Inf. Dan Ilmu Komput., vol. 10, no. 1, pp. 131–140, Feb. 2023, doi: 10.25126/jtiik.20231016337.

J. Lee et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, Feb. 2020, doi: 10.1093/bioinformatics/btz682.

Q. Qin, S. Zhao, and C. Liu, “A BERT‐BiGRU‐CRF Model for Entity Recognition of Chinese Electronic Medical Records,” Complexity, vol. 2021, no. 1, p. 6631837, Jan. 2021, doi: 10.1155/2021/6631837.

J. Jiang, M. Cheng, Q. Liu, Z. Li, and E. Chen, “Nested Named Entity Recognition from Medical Texts: An Adaptive Shared Network Architecture with Attentive CRF,” Nov. 09, 2022, arXiv: arXiv:2211.04759. doi: 10.48550/arXiv.2211.04759.

A. Makarova et al., “Overfitting in Bayesian Optimization: an empirical study and early-stopping solution,” May 2021, doi: 10.3929/ETHZ-B-000521574.

T. D. Tran, M. N. Ha, L. H. B. Nguyen, and D. Dinh, “Improving Multi-Grained Named Entity Recognition with BERT and Focal Loss,” ICIC International, 2021, doi: 10.24507/icicelb.12.03.291.

Y. Tong et al., “Improving biomedical named entity recognition by dynamic caching inter-sentence information,” Bioinformatics, vol. 38, no. 16, pp. 3976–3983, Aug. 2022, doi: 10.1093/bioinformatics/btac422.

X. Zhao, J. Greenberg, Y. An, and X. T. Hu, “Fine-Tuning BERT Model for Materials Named Entity Recognition,” in 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA: IEEE, Dec. 2021, pp. 3717–3720. doi: 10.1109/BigData52589.2021.9671697.

S. Belkadi, L. Han, Y. Wu, and G. Nenadic, “Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition,” Oct. 30, 2023, arXiv: arXiv:2210.12770. doi: 10.48550/arXiv.2210.12770.

S. Sarica and J. Luo, “Stopwords in technical language processing,” PLOS ONE, vol. 16, no. 8, p. e0254937, Aug. 2021, doi: 10.1371/journal.pone.0254937.

S. Nemoto, S. Kitada, and H. Iyatomi, “Majority or Minority: Data Imbalance Learning Method for Named Entity Recognition,” IEEE Access, vol. 13, pp. 9902–9909, 2025, doi: 10.1109/ACCESS.2024.3522972.

K. Anam, “Early Stopping on CNN-LSTM Development to Improve Classification Performance,” J. Appl. Data Sci., vol. 5, no. 3, pp. 1175–1188, Sep. 2024, doi: 10.47738/jads.v5i3.312.

Y. Yin et al., “Augmenting biomedical named entity recognition with general-domain resources,” J. Biomed. Inform., vol. 159, p. 104731, Nov. 2024, doi: 10.1016/j.jbi.2024.104731.

L. K. Meng, H. H. Yi, and N. B. Wei, “A Machine Learning Approach for Face Mask Detection System with AdamW Optimizer,” J Appl Technol Innov, vol. 7, no. 3, 2023, doi: 25.

L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA: IEEE, Mar. 2017, pp. 464–472. doi: 10.1109/WACV.2017.58.

A. Mascio et al., “Comparative Analysis of Text Classification Approaches in Electronic Health Records,” May 08, 2020, arXiv: arXiv:2005.06624. doi: 10.48550/arXiv.2005.06624.

Y. Munarko, M. S. Sutrisno, W. A. I. Mahardika, I. Nuryasin, and Y. Azhar, “Named entity recognition model for Indonesian tweet using CRF classifier,” IOP Conf. Ser. Mater. Sci. Eng., vol. 403, p. 012067, Oct. 2018, doi: 10.1088/1757-899X/403/1/012067.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Optimalisasi Model BioBERT untuk Pengenalan Entitas pada Teks Medis dengan Conditional Random Fields (CRF)

Optimalisasi Model BioBERT untuk Pengenalan Entitas pada Teks Medis dengan Conditional Random Fields (CRF)

Abstract

Downloads

References

Most read articles by the same author(s)