System of information extraction from text for metagraph knowledge base

350 rub

Journal Dynamics of Complex Systems - XXI century №3 for 2020 г.

Article in number:

DOI: 10.18127/j19997493-202003-0 -

UDC: 004.912

Keywords: Natural language processing semantic search hybrid intelligence system metagraphs

Authors:

V.I. Terekhov – Ph.D. (Eng.), Associate Professor,

Bauman Moscow State Technical University (Moscow, Russia)

E-mail: terekchow@bmstu.ru

A.I. Kanev – Post-graduate Student,

Bauman Moscow State Technical University (Moscow, Russia)

E-mail: kanevai@student.bmstu.ru

Abstract:

Existing information retrieval systems cannot successfully answer all user queries. Therefore, machine learning and knowledge graphs are used to improve search results. But these methods have limitations. Machine learning requires significant computing power and large data sets. The use of an already trained system in a new subject area is also accompanied by problems of insufficient data for training. At the same time, rule-based systems have data inconsistency and looping problems. Also they need a large efforts for creating all rules.

To overcome these problems, it is proposed to use hybrid intelligent systems. Hybrid intelligent systems are currently being developed, one example of which is metagraph systems. Metagraphs combine knowledge processing techniques and soft computing. For knowledge representation metavertexes correspond to concepts and metaedges match to relations between them. Metavertexes and metaedges have weights. It is possible to combine several metavertexes and metaedges into fragment of metagraph and assign it with other metavertexe or metaedge to represent hierachical knowledge.

Metagraphs can be used not only for information retrieval, but also for the classification of texts, machine translation, as well as processing other types of information, such as images. But filling that system with data from the text is still an important and unsolved problem.

To create that system, it is proposed to use a pipeline of natural language processing consisting of tokenization, morphological, syntactic and context analysis. A free library is used for morphological analysis. Parsing is implemented using a context-free constituent grammar. The context analyzer has memory and processing rules for filling memory.

The purpose of this work is the development and experimental evaluation of the system for extracting information from the text to fill the metagraph knowledge base.

This work obtained important results of evaluating the text analysis system on texts from Open Corpora and it gave recommendations for its further development. To analyze the capabilities and limitations of the entire system, a study of the morphological analysis library Russian Morphology was carried out.

An analytical estimation of the execution time for natural language processing was calculated based on the analysis of its constituent stages. This estimation coincided with the experimental data. The author describes the features of data processing, as well as the difficulties associated with the implementation of the system.

Pages: 57-66

References

Mousavi H. Mining Semantic Structures from Syntactic Structures in Free Text Documents. 2014 IEEE International Conference on Semantic Computing. 2014. DOI: 10.1109/ICSC.2014.31.
Pizzi N., Krishnamoorthy R. Tactical Clinical Text Mining for Improved Patient Characterization. 2014 IEEE International Congress on Big Data. 2014. DOI: 10.1109/BigData.Congress.2014.101.
Gong T., Tan C.L., Leong T.Y. Text Mining in Radiology Reports. 2008 Eighth IEEE International Conference on Data Mining. 2008. DOI: 10.1109/ICDM.2008.150.
Anisimovich K.V., Drughkin K.Yu., Zuyev K.A., Minlos F.R., Petrova M.A., Selegey V.P. Sintaksicheskiy I sementicheskiy parser, osnovannyy na lingvisticheskih tehnologiyah ABBYY Compreno. XVIII Mezhdunar. konf. “Dialog 2012”. 2012. S. 91-103 (in Russian).
Sussna M. Word sense disambiguation for free-text indexing Using a Massive Semantic Network. Proceedings of the second international conference on Information and knowledge management. 1993. P 67-74. DOI: 10.1145/170088.170106.
Shapiro S. Encyclopedia of Artificial Intelligence. Second edition. Wiley. 1992.
Sutskever I., Vinyals O., Le Q.V. Sequence to Sequence Learning with Neural Networks. Advances in neural information processing systems. 2014.
Cho K., van Merrienboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014. P. 1724-1734. DOI: 10.3115/v1/D14-1179.
Radford A., Narasimhan K., Salimans T., Sutskever I. Improving Language Understanding by Generative Pre-Training. 2018. URL: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (data obrashcheniya 15.08.2020).
Mikolov T., Le Q. V., Sutskever I. Exploiting Similarities among Languages for Machine Translation. 2013. URL: https://arxiv.org/pdf/1309.4168.pdf (data obrashcheniya 15.08.2020).
Nakagochi R., Kawamoto K., Sunayama W. Acquissition of Text-Mining Skills for Beginners Using TETDM. 13th International Conference on Data Mining Workshops. 2013. DOI: 10.1109/ICDMW.2013.49.
Otsuka N., Matsushita M. Constructing Knowledge Using Exploratory Text Mining. Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS). 2014. DOI: 10.1109/SCIS-ISIS.2014.7044806.
Sunayama W. Knowledge Emergence using Total Environment for Text Data Mining. Joint 7th International Conference on Soft Computing and Intelligent Systems (SCIS) and 15th International Symposium on Advanced Intelligent Systems (ISIS). 2014. DOI: 10.1109/SCIS-ISIS.2014.7044696.
Chernenkiy V.M., Gapanyuk Yu.E. Revunkov G.I., Terehov V.I., Kaganov Yu.T. Metagrafovyy podhod dlya opisaniya Gibridnyh Intellectualnyh Informatsionnyh Sisitem. Prikladnaya informatika, 2017, vol. 12, № 3(69), S. 57-79 (in Russian).
Kanev A., Cunningham S., Terekhov V. Application of Formal Grammar in Text Mining and Construction of an Ontology. Internet Technologies and Applications (ITA 2017). Proceedings of the 7th International Conference. 2017. DOI: 10.1109/ITECHA.2017.8101910.
Segalovich I. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications. MLMTA'03. 2003

Date of receipt: 03.08.2020