Analysis of systems of information extraction from unstructured text documents

350 rub

Journal Neurocomputers №1 for 2025 г.

Article in number:

Type of article: scientific article

DOI: https://doi.org/10.18127/j19998554-202501-01

UDC: 004.89

Keywords: Natural language processing natural language information extraction named entity extraction relationship extraction pat-tern rules machine learning deep learning large language models NLP IE NER RE ML DL LLM

Authors:

G.S. Ivanova1, P.A. Martynyuk2
1, 2 Bauman Moscow State Technical University (Moscow, Russia)

1 gsivanova@bmstu.ru, 2 martynyuk.pa@bmstu.ru

Abstract:

In the conditions of the modern information society, the volume of text information is constantly increasing, creating the need to develop and implement effective methods for processing it and extracting useful information. Automating this process becomes crucial to improving the speed and accuracy of text data processing. Current approaches include rule-based methods, as well as more recent methods using machine and deep learning algorithms. Despite their diversity and effectiveness, each of these approaches has its own limitations and applications. This article is devoted to the analysis of classes of information extraction methods and the survey of real text data processing systems that implement these methods in practice.

The purpose of this work is to carry out a study of the main tasks and methods of extracting information from texts used in text document analysis systems, as well as to analyze existing systems that implement these methods and their combinations. The analysis aims to identify the advantages and limitations of methods based on data from real software systems.

In the course of the work, the main tasks of information extraction (extraction of named entities and extraction of relationships) have been considered and also classes of information extraction methods that implement these tasks: rules-based methods; methods based on machine learning; methods based on deep learning. The advantages and disadvantages of each method have been revealed. The article also discusses examples of real systems, both implementing methods of one class and using combinations of methods from different classes – hybrid systems. As a result of the analysis performed, the article identifies a number of main problems in the subject area of analyzing unstructured text documents, and also suggests ways to overcome the vulnerabilities of certain classes of methods for extracting information from text.

The research has practical value for developers of text data processing systems and analysts working with large volumes of information. The information presented in the work about various approaches to extracting information from text allows specialists to get a clear understanding of each of them, as well as evaluate the prospects for using methods using examples of real text data processing systems. In addition, with the rapid development of technology and increasing data volumes, this study provides relevant information to help adaptation of existing systems and processes to modern requirements and challenges.

Pages: 5-27

For citation

Ivanova G.S., Martynyuk P.A. Analysis of systems of information extraction from unstructured text documents. Neurocomputers. 2025. V. 27. № 1. P. 5–27. DOI: https://doi.org/10.18127/j19998554-202501-01 (in Russian)

References

Jiang J. Information extraction from text. Mining Text Data. Ed. by C. Aggarwal, C. Zhai. Springer: Boston, MA. 2012. P. 11–41. DOI: https://doi.org/10.1007/978-1-4614-3223-4_2.
Adnan K., Akbar R. An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data. 2019. V. 6. № 1. P. 1–38. DOI: https://doi.org/10.1186/s40537-019-0254-8.
Jehangir B., Radhakrishnan S., Agarwal R. A survey on named entity recognition – datasets, tools, and methodologies. Natural Language Processing Journal. 2023. V. 3. P. 100017. DOI: https://doi.org/10.1016/j.nlp.2023.100017.
Detroja K., Bhensdadia C.K., Bhatt B.S. A survey on relation extraction. Intelligent Systems with Applications. 2023. V. 19. P. 200244. DOI: https://doi.org/10.1016/j.iswa.2023.200244.
Chang H. et al. JoinER-BART: Joint entity and relation extraction with constrained decoding, representation reuse and fusion. IEEE/ ACM Transactions on Audio, Speech, and Language Processing. 2023. P. 1–14. DOI: https://doi.org/10.1109/TASLP.2023.3310879.
Xue K. et al. Fine-tuning BERT for joint entity and relation extraction in Chinese medical text. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2019. P. 892–897.
Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019. V. 1. P. 4171–4186.
Liu Y. et al. Roberta: A robustly optimized BERT pretraining approach. arXiv preprint. arXiv:1907.11692. 2019.
Yadav V., Bethard S. A survey on recent advances in named entity recognition from deep learning models. Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA. 2018. P. 2145–2158.
Mehta R., Varma V. LLM-RM at SemEval-2023 task 2: Multilingual complex NER using XLM-RoBERTa. Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). Toronto, Canada. 2023. P. 453–456.
Park Y.J. et al. Web interface of NER and RE with BERT for biomedical text mining. Applied Sciences. 2023. V. 13. № 8. P. 5163. DOI: https://doi.org/10.3390/app13085163.
Sun C. et al. Biomedical named entity recognition using BERT in the machine reading comprehension framework. Journal of Biomedical Informatics. 2021. V. 118. P. 103799. DOI: https://doi.org/10.1016/j.jbi.2021.103799.
Zhang Q. et al. An entity relationship extraction model based on BERT-BLSTM-CRF for food safety domain. Computational Intelligence and Neuroscience. 2022. V. 2022. Article ID 7773259. DOI: https://doi.org/10.1155/2022/7773259.
Wang S. et al. GPT-NER: Named entity recognition via large language models. arXiv preprint. arXiv:2304.10428. 2023.
Wan Z. et al. GPT-RE: In-context learning for relation extraction using large language models. Processing of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. P. 3534–3547.
Ivanova G.S., Martynyuk P.A. Analiz metodov izvlecheniya informatsii iz tekstovykh dannykh. Nejrokomp'yutery: razrabotka, primenenie. 2022. T. 24. № 3. S. 18–28. DOI: 10.18127/j19998554-202203-02. (in Russian)
Imaichi O., Yanase T., Niwa Y. A comparison of rule-based and machine learning methods for medical information extraction. The First Workshop on Natural Language Processing for Medical and Healthcare Fields. Nagoya. 2013. P. 38–42.
Appelt D. The common pattern specification language. Technical report. SRI International. Artificial Intelligence Center. 1996. P. 23–30.
Proekt LSPL. Opisanie yazyka LSPL (1.0.1) [Elektronnyj resurs]. URL: http://www.lspl.ru/articles/LSPL_Refguide_13.pdf (data obrashcheniya: 11.05.2024). (in Russian)
Zhang H. The optimality of naive Bayes. AA. 2004. V. 1. № 2. P. 3.
Rabiner L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. V. 77. № 2. P. 257–286.
Mccallum A., Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL. 2003. P. 188–191.
Pennington J., Socher R., Manning C.D. GloVe: Global vectors for word representation. Proceedings of International Conference on Empirical Methods in Natural Language Processing (EMNLP-2014). 2014. P. 1532–1543.
Sazli M. A brief review of feed-forward neural networks. Communications Faculty of Science University of Ankara. 2006. V. 50. P. 11–17. DOI: 10.1501/commua1-2_0000000026.
Rumelhart D., Hinton G., Williams R. Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. V. 1: Foundations. Cambridge, MA: MIT Press. 1986. P. 1–34.
Jordan M.I. Serial order: a parallel distributed processing approach. Tech. rep. ICS 8604. Institute for Cognitive Science, University of California. 1986. P. 1–40.
Schuster M., Paliwal K.K. Bidirectional recurrent neural networks. Proceedings of the 1997 IEEE Transactions on Signal Processing. 1997. V. 45. № 11. P. 2673–2681.
Hochreiter S., Schmidhuber J. Long short-term memory. Neural Computation. 1997. V. 9. № 8. P. 1735–1780.
Gers F.A., Schmidhuber J., Cummins F. Learning to forget: Continual prediction with LSTM. Neural Computation. 2000. V. 12. № 10. P. 2451–2471.
Gers F., Schmidhuber J. Recurrent nets that time and count. Proceedings of the International Joint Conference on Neural Networks. 2000. V. 3. P. 189–194.
Cho K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014. P. 1724–1734.
Zhou G.B. et al. Minimal gated unit for recurrent neural networks. International Journal of Automation and Computing. 2016. P. 13. DOI: 10.1007/s11633-016-1006-2.
Sutskever I., Vinyals O., V. Le Q. Sequence to sequence learning with neural networks. Advances in neural information processing systems. 2014. P. 3104–3112.
LeCun Y., Boser B.E., Denker J.S., Henderson D., Howard R.E., Hubbard W.E., Jackel L.D. Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing Systems. 1990. V. 2. P. 396–404.
Vaswani A. et al. Attention is all you need. Advances in Neural Information Processing Systems. 2017. V. 30. P. 5998–6008.
Mikolov T. et al. Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR. 2013. arXiv:1301. 3781v1.
Mikolov T., Sutskever I., Chen K., Corrado G. S., Dean J. Distributed representations of words and phrases and their compositionality. NIPS'13: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013. V. 2. P. 3111–3119.
Bojanowski P., Grave E., Joulin A., Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. 2017. V. 5. P. 135–146. DOI: 10.1162/tacl_a_00051.
Mykowiecka A., Marciniak M., Kupść A. Rule-based information extraction from patients’ clinical data. Journal of biomedical informatics. 2009. V. 42. № 5. P. 923–936. DOI: https://doi.org/10.1016/j.jbi.2009.07.007.
García-Constantino M., Atkinson K., Bollegala D., Chapman K., Coenen F., Roberts C., Robson K. CLIEL: Context-based information extraction from commercial law documents. Proceedings of the 16th International Conference on Artificial Intelligence and Law (ICAIL'17). New York, USA. 2017. R. 79–87.
Kadhim J.K., Sadiq T.A., Abdulah S.H. Unsupervised-based information extraction from unstructured Arabic legal documents. Opción. 2019. V. 35. № 20. P. 1097–1117.
Wang W. et al. Sentiment information extraction of comparative sentences based on CRF model. Computer Science and Information Systems. 2017. V. 14. № 3. P. 823–837.
Garg D. Information extraction from biomedical text using machine learning. Master's Projects. 2019. P. 38. DOI: https://doi.org/ 10.31979/etd.njww-hye6.
Wang Z. et al. Intelligent information extraction algorithm of agricultural text based on machine learning method. Journal of Physics: Conference Series. 2021. V. 1952. № 2. P. 022073.
Steinkamp J.M. et al. Toward complete structured information extraction from radiology reports using machine learning. Journal of digital imaging. 2019. V. 32. P. 554–564.
Wang Z. et al. Cross-domain contract element extraction with a bi-directional feedback clause-element relation network. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21). New York, USA. 2021. P. 1003–1012. DOI: 10.1145/3404835.3462873.
Butcher B. et al. Optimising human-machine collaboration for efficient high precision information extraction from text documents. arXiv preprint. arXiv:2302.09324. 2023.
Jouffroy J. et al. Hybrid deep learning for medication-related information extraction from clinical texts in French: MedExt algorithm development study. JMIR medical informatics. 2021. V. 9. № 3. P. e17934. DOI: https://doi.org/10.2196/17934.
Yahya A., Salameh H., Belbeisi M., Shamasneh N. Information extraction from Arabic medications leaflets. Proceedings of 2022 IEEE 16th International Conference on Application of Information and Communication Technologies (AICT). Washington DC, USA. 2022. P. 1–7. DOI: 10.1109/AICT55583.2022.10013568.

Date of receipt: 21.08.2024

Approved after review: 27.09.2024

Accepted for publication: 24.01.2025