E.V. Gordeeva1, R.A. Kochkarov2, A.A. Rylov3
1–3 Financial University under the Government of the Russian Federation (Moscow, Russia)
Problem. In the field of natural language processing, there is an urgent task of recognizing the subject of text, for which the basic methods of data preprocessing are used. This task is of great importance in various fields of human activity related to the processing of textual information.
Target. Choose a machine learning algorithm and optimize the model using hyperparameter selection in text topic recognition tasks.
Results. The preliminary processing of data for the analysis of textual information was carried out, and the most suitable machine learning model was selected to achieve the best results in the tasks of recognizing the topic of text. Various machine learning algorithms, including naive Bayesian classification, methods are considered k-nearest neighbors, augmented naive Bayesian classification and decision tree classifier. The method of using GridSearchCV for selecting the optimal hyperparameter of the model is proposed.
Practical significance. The use of the proposed methods of data preprocessing and selection of hyperparameters will increase the effectiveness of text topic recognition models and obtain more accurate results.
Gordeeva E.V., Kochkarov R.A., Rylov A.A. Analysis of the text theme recognition problem using machine learning. Neurocomputers. 2023. V. 25. № 4. Р. 7-15. DOI: https://doi.org/10.18127/j19998554-202304-02 (In Russian)
- Your guide to the NLP world. [Electronic resource] – Access mode: https://habr.com/ru/companies/otus/articles/705482/, date of reference 18.04.2023.
- The basics of Natural Language Processing for text. [Electronic resource] – Access mode: https://habr.com/ru/company/Voximplant/ blog/ 446738/, date of reference 18.04.2023.
- Natural language processing. [Electronic resource] – Access mode: http://neerc.ifmo.ru/wiki/index.php?title=Обработка_естественного_языка, date of reference 18.04.2023.
- Jurafsky D., Martin J.H. Speech and Langauge Processing. 2nd ed. New Jersey: Prentice Hall. 2008. 1024 p.
- Ivanova G.S., Martynyuk P.A. Analysis of neural network language models for solving problems of text data processing. Neurocomputers. 2023. V. 25. № 2. Р. 5-20. DOI 10.18127/j19998554-202302-01. (In Russian)
- Terekhov V.I., Kanev A.I. Information extraction system from text for meta graph knowledge base. Dynamics of complex systems. 2020. V. 14. № 3. P. 57–66. DOI 10.18127/j19997493-202003-05. (in Russian)
- Natural Language Processing with Python. [Electronic resource] – Access mode: http://www.nltk.org/book/ch00.html, date of reference 13.05.2023.
- What is Tokenization in Natural Language Processing (NLP)? [Electronic resource] – Access mode: https://www.machinelearningplus.com/nlp/what-is-tokenization-in-natural-language-processing/, date of reference 13.05.2023.
- TF-IDF – Term Frequency-Inverse Document Frequency. [Electronic resource] – Access mode: https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/, date of reference 13.05.2023.