Supervised Learning Approaches for Nested People Entity Extraction in Indonesian Translated Quran


  • Dimitri Irfan Dzidny * Mail Telkom University, Bandung, Indonesia
  • Moch Arif Bijaksana Telkom University, Bandung, Indonesia
  • Kemas Muslim Lhaksmana Telkom University, Bandung, Indonesia
  • (*) Corresponding Author
Keywords: Quran; People Entity Extraction; Supervised Learning; Comparative Analysis; Nested Entities

Abstract

Since the Quran is the primary holy book for Muslims, information extraction research on Quranic texts, especially in a form of People Entity Extraction, is an important task for further Quran and Tafseer understanding. The challenges in extracting people entities from the Quranic text is that many verses have a complex structure, such as nested entities, making it crucial to build a system that can extract the entity automatically, accurately, and quickly. People Entity Extraction on Quran itself is a task that aims to extract people entities in a sentence or verse, such as the name of a person, the name of a group, etc. on the Quranic texts. Example of input taken from snippet Surah Al-Baqarah verse 46 which reads “Those who believe that they will meet their Lord and that they will return to him” from that input the people entity extraction system is expected can identify people entities i.e. “Those who believe that they will meet their Lord”. Currently, People Entity Extraction research for the Quran has not been widely carried out, only a few algorithms with scattered results have been conducted. In this research, we will use several supervised models which are Conditional Random Field (CRF), BiLSTM-CRF, and a pre-trained deep learning model based on IndoBERT transformers. We apply and perform a comparative analysis for the performance of those several models. We found out that deep learning based model, namely BiLSTM-CRF perform best at extracting people entities, whilst probabilistic based model, namely CRF, had difficulty in extracting people entities, specifically nested people entities.

Downloads

Download data is not yet available.

References

S. H. Nasr, C. K. Dagli, M. M. Dakake, J. E. B. Lumbard, and M. Rustom, “The Study Quran,” A new Transl. Comment., vol. 19, 2015.

R. Grishman and B. M. Sundheim, “Message understanding conference-6: A brief history,” 1996.

D. Nadeau and S. Sekine, “A survey of named entity recognition and classification,” Lingvisticae Investig., vol. 30, no. 1, pp. 3–26, 2007.

V. Yadav and S. Bethard, “A survey on recent advances in named entity recognition from deep learning models,” arXiv Prepr. arXiv1910.11470, 2019.

J. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001.

Z. Huang, W. Xu, and K. Yu, “Bidirectional LSTM-CRF models for sequence tagging,” arXiv Prepr. arXiv1508.01991, 2015.

B. Wilie et al., “IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding,” arXiv Prepr. arXiv2009.05387, 2020.

A. Vaswani et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.

L. A. Ramshaw and M. P. Marcus, “Text chunking using transformation-based learning,” in Natural language processing using very large corpora, Springer, 1999, pp. 157–176.

E. F. Sang and F. De Meulder, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” arXiv Prepr. cs/0306050, 2003.

P. Q. N. Minh, “A feature-based model for nested named-entity recognition at VLSP-2018 ner evaluation campaign,” arXiv Prepr. arXiv1803.08463, 2018.

A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf, “FLAIR: An easy-to-use framework for state-of-the-art NLP,” in {NAACL} 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp. 54–59.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Adv. Neural Inf. Process. Syst., vol. 26, 2013.

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Trans. Assoc. Comput. Linguist., vol. 5, pp. 135–146, 2017.

E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, “Learning Word Vectors for 157 Languages,” 2018.

J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb.org/anthology/D14-1162

H. Nakayama, “{seqeval}: A Python framework for sequence labeling evaluation.” 2018. [Online]. Available: https://github.com/chakki-works/seqeval


Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Supervised Learning Approaches for Nested People Entity Extraction in Indonesian Translated Quran

Dimensions Badge
Article History
Submitted: 2022-06-24
Published: 2022-06-30
Abstract View: 134 times
PDF Download: 127 times
How to Cite
Dzidny, D., Bijaksana, M., & Lhaksmana, K. (2022). Supervised Learning Approaches for Nested People Entity Extraction in Indonesian Translated Quran. Building of Informatics, Technology and Science (BITS), 4(1), 241−246. https://doi.org/10.47065/bits.v4i1.1758
Issue
Section
Articles