Supervised Learning Approaches for Nested People Entity Extraction in Indonesian Translated Quran
Abstract
Since the Quran is the primary holy book for Muslims, information extraction research on Quranic texts, especially in a form of People Entity Extraction, is an important task for further Quran and Tafseer understanding. The challenges in extracting people entities from the Quranic text is that many verses have a complex structure, such as nested entities, making it crucial to build a system that can extract the entity automatically, accurately, and quickly. People Entity Extraction on Quran itself is a task that aims to extract people entities in a sentence or verse, such as the name of a person, the name of a group, etc. on the Quranic texts. Example of input taken from snippet Surah Al-Baqarah verse 46 which reads “Those who believe that they will meet their Lord and that they will return to him” from that input the people entity extraction system is expected can identify people entities i.e. “Those who believe that they will meet their Lord”. Currently, People Entity Extraction research for the Quran has not been widely carried out, only a few algorithms with scattered results have been conducted. In this research, we will use several supervised models which are Conditional Random Field (CRF), BiLSTM-CRF, and a pre-trained deep learning model based on IndoBERT transformers. We apply and perform a comparative analysis for the performance of those several models. We found out that deep learning based model, namely BiLSTM-CRF perform best at extracting people entities, whilst probabilistic based model, namely CRF, had difficulty in extracting people entities, specifically nested people entities.
Downloads
References
S. H. Nasr, C. K. Dagli, M. M. Dakake, J. E. B. Lumbard, and M. Rustom, “The Study Quran,” A new Transl. Comment., vol. 19, 2015.
R. Grishman and B. M. Sundheim, “Message understanding conference-6: A brief history,” 1996.
D. Nadeau and S. Sekine, “A survey of named entity recognition and classification,” Lingvisticae Investig., vol. 30, no. 1, pp. 3–26, 2007.
V. Yadav and S. Bethard, “A survey on recent advances in named entity recognition from deep learning models,” arXiv Prepr. arXiv1910.11470, 2019.
J. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001.
Z. Huang, W. Xu, and K. Yu, “Bidirectional LSTM-CRF models for sequence tagging,” arXiv Prepr. arXiv1508.01991, 2015.
B. Wilie et al., “IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding,” arXiv Prepr. arXiv2009.05387, 2020.
A. Vaswani et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
L. A. Ramshaw and M. P. Marcus, “Text chunking using transformation-based learning,” in Natural language processing using very large corpora, Springer, 1999, pp. 157–176.
E. F. Sang and F. De Meulder, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” arXiv Prepr. cs/0306050, 2003.
P. Q. N. Minh, “A feature-based model for nested named-entity recognition at VLSP-2018 ner evaluation campaign,” arXiv Prepr. arXiv1803.08463, 2018.
A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf, “FLAIR: An easy-to-use framework for state-of-the-art NLP,” in {NAACL} 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp. 54–59.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Adv. Neural Inf. Process. Syst., vol. 26, 2013.
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Trans. Assoc. Comput. Linguist., vol. 5, pp. 135–146, 2017.
E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, “Learning Word Vectors for 157 Languages,” 2018.
J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb.org/anthology/D14-1162
H. Nakayama, “{seqeval}: A Python framework for sequence labeling evaluation.” 2018. [Online]. Available: https://github.com/chakki-works/seqeval
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel Supervised Learning Approaches for Nested People Entity Extraction in Indonesian Translated Quran
Pages: 241−246
Copyright (c) 2022 Dimitri Irfan Dzidny, Moch Arif Bijaksana, Kemas Muslim Lhaksmana

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).