G.S. Ivanova1, P.A. Martynyuk2
1,2 Bauman Moscow State Technical University (Moscow, Russia)
Problem setting. Due to the growth of volumes of text information, there is an increasing need to create systems for automatic or automated processing of text data. There are several basic approaches to extracting information from texts. These are classical approaches based on extraction rules and the laws of probability and statistics, as well as a fundamentally new approach using neural network models. This article is devoted to the analysis of various approaches to extracting information from natural language texts.
Target. Analysis of methods for extracting information from text data in order to determine the specifics, advantages and disadvantages of each of the approaches.
Results. For each of the analyzed approaches, the main ideas and concepts of information extraction are outlined, and the existing implementations of the approaches are presented. The strengths and weaknesses of the approaches are described. The idea of joint use of approaches in the creation of natural language processing systems in order to mutually compensate for the shortcomings of approaches and improve the quality of information extraction is proposed.
Practical significance. The results of the analysis can be useful in practice for developers of text data processing systems. The article provides basic information about each of the considered approaches in a summary, which can help specialists in choosing a model (or models) for extracting information.
Ivanova G.S., Martynyuk P.A. Analysis of methods for extracting information from text data. Neurocomputers. 2022. V. 24. № 3. Р. 18-28. DOI: https://doi.org/10.18127/j19998554-202203-02 (in Russian)
- Shelmanov A.O., Isakov V.A., Stankevich M. A., Smyrnov I.V. Otkrytoe izvlechenie informacii iz tekstov CHast' I. Postanovka zadachi i obzor metodov [Open Information Extraction. Part I. The task and the Review of the State of the Art]. Iskusstvennyj intellekt i prinyatie reshenij [Artificial Intelligence and Decision Making]. 2018. № 2. P. 47-61 (in Russian).
- Malte A., Ratadiya P. Evolution of Transfer Learning in Natural Language Processing. arXiv preprint arXiv. 1910.07370. 2019.
- Minsky M. Frejmy dlya predstavleniya znanij [A Framework for Representing Knowledge]. M.: Energia. 1979. 151 p. (in Russian).
- Andreev A.M., Berezkin D.V., Simakov K.V. Model' izvlecheniya znanij iz estestvenno-yazykovyh tekstov [Model of knowledge extraction from natural language texts]. Informacionnye tekhnologii. 2007. № 12. P. 57-63 (in Russ.)
- Rau L.F. Extracting company names from text. Proceedings of the Seventh Conference on Artificial Intelligence Applications CAIA-92 (Volume I: Technical Papers). 1991. P. 29–32.
- Skatov D.S., Vdovina N.A., Liverko S.V., Okat'ev V.V. YAzyk opisaniya pravil v sisteme leksicheskogo analiza EYA-tekstov DictaScope Tokenizer [The language for describing rules in the system of lexical analysis of NL-texts DictaScope Tokenizer]. Proceedings of the International Conference Dialogue. 2010. P. 442–449. (in Russian).
- Okat'ev V.V., Alekseev V.E., Erekhinskaya T.N., Skatov D.S. Sintaksicheskij analiz estestvennogo yazyka i biblioteka sintaksicheskogo analiza DictaScope [Natural language parsing and DictaScope parsing library]. Materials of the conference Tekhnologii Microsoft v teorii i praktike programmirovaniya [Microsoft technologies in the theory and practice of programming]. Nizhny Novgorod: NNSU. 2009. P. 319-325. (in Russian).
- Appelt D. The Common Pattern Specification Language. Technical report, SRI International, Artificial Intelligence Center. 1996. P. 23-30.
- Bol'shakova E.I., Baeva N.V., Bordachenkova E.A., Vasil'eva N.E., Morozov S.S. Leksiko-sintaksicheskie shablony v zadachah avtomaticheskoj obrabotki teksta [Lexicosyntactic patterns for automatic text processing]. Proceedings of the International Conference Dialogue. 2007. P. 70-75 (in Russian).
- Golovkov A.A., Ivanova G.S. Obrabotka geolokacionnoj informacii kak prakticheskaya zadacha mashinnogo obucheniya [Processing of geolocation information as a practical task of machine learning]. Upravlenie kachestvom inzhenernogo obrazovaniya. Vozmozhnosti VUZov i potrebnosti promyshlennosti: Abstracts of the second international scientific-practical conference. Moscow. June 23-25 2016. M.: Bauman Moscow State Technical University. 2016. P. 38-39 (in Russian).
- Ivanova G.S., Golovkov A.A., Umnov A.V., et al. Metody mashinnogo obucheniya v zadache diagnostirovaniya rentgenovskih mammologicheskih snimkov [Machine learning methods in the problem of diagnosing breast X-ray images]. Dinamika slozhnyh sistem - XXI vek. 2019. V. 13. № 1. P. 25-32 (in Russian).
- Ivanova G.S., Golovkov A.A., Tyurin V.A. Detektirovanie i klassifikaciya ob"ektov na izobrazheniyah v infrakrasnom spektre [Detection and classification of objects in infrared images]. Tekhnologii inzhenernyh i informacionnyh sistem. 2017. № 2. P. 81-90 (in Russian).
- Zhang H. The optimality of naive Bayes. AA. 2004. V. 1. № 2. P. 3.
- Rabiner L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. V. 77. № 2.
P. 257-286. - Mccallum A., Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL. 2003. P. 188-191.
- Ponomareva N, Rosso P, Pla F, Molina A. Conditional random fields vs. hidden markov models in a biomedical named entity recognition task. Proceedings of International Conference Recent Advances in Natural Language Processing (RANLP). 2007. P. 479-483.
- Liu Z., Lin Y., Sun M. Representation Learning for Natural Language Processing. Springer. 2020.
- Il'vovskij D., CHernyak E. Glubinnoe obuchenie dlya avtomaticheskoj obrabotki tekstov [Deep Learning for Automatic Word Processing]. Otkrytye Sistemy. SUBD [Electronic scientific publication]. 2017. № 2. URL: https://www.osp.ru/os/2017/02/13052221 (accessed: 11.02.2022) (in Russian).
- Kadermyatova L. M., Tutubalina E. V. Analiz modelej vektornyh predstavlenij slov v zadache razmetki semanticheskih rolej v russkoyazychnyh tekstah [Analysis of models of vector representations of words in the problem of labeling semantic roles in Russian texts.]. Elektronnye biblioteki [Electronic scientific publication]. 2020. V. 23. № 5. P. 1026-1043. URL: https://elbib.ru/issue/view/109/31 (accessed: 17.02.2022) (in Russian).
- Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. V. 1 (Long and Short Papers). 2019. P. 4171–4186.
- Wang A., Singh A., Michael J., Hill F., Levy O., Bowman S.R. GLUE: a multi-task benchmark and analysis platform for natural language understanding. ICLR 2019 Conference. Paper 1323. 2019. P. 20.
- DeepPavlov’s documentation. Electronic documentation of the DeepPavlov library. URL: http://docs.deeppavlov.ai/en/master/ index.html (data obrashhenija: 19.02.2022).
- Burtsev M., Seliverstov A., Airapetyan R., Arkhipov M., Baymurzina D., Bushkov N., Gureenkova O., Khakhulin T., Kuratov Yu., Kuznetsov D., Litinsky A., Logacheva V., Lymar A., Malykh V., Petrov M., Polulyakh V., Pugachev L., Sorokin A., Vikhreva M., Zaynutdinov M. DeepPavlov: Open-Source Library for Dialogue Systems. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics-System Demonstrations. 2018. P. 122–127.