Seq2Seq approach and large language models for term extraction from russian scientific texts

500 rub

Journal Highly available systems №1 for 2026 г.

Article in number:

Type of article: scientific article

DOI: https://doi.org/10.18127/j20729472-202601-14

UDC: 004.855

Keywords: Automatic term extraction large language models seq2seq T5 RuTermEval-2024 CL-RuTerm3 computational linguistics natural language processing few-shot learning Russian-language scientific texts

Authors:

K.K. Biderina1, D.I. Grebenkov2

1-2 Institute of Control Sciences RAS (Moscow, Russia)

1 kkbiderina@edu.hse.ru, 2 grebenkov-d-i@mail.ru

Abstract:

Problem statement. Automatic term extraction from Russian-language scientific texts is a pressing problem in computational linguistics and information retrieval. The effectiveness of large language models without additional training compared to adapted architectures remains understudied, especially for the Russian language and specialized scientific corpora.

Objective. The aim of this study is to investigate and compare two approaches to automatic term extraction from Russian-language scientific texts – a specialized neural network solution based on the T5 architecture, additionally trained for a sequence-to-sequence problem, and general-purpose large language models.

Results. This study implemented a set of programs and models for extracting terms from abstracts and full texts of scientific publications based on the CL-RuTerm3 dataset. An additional experiment was conducted to evaluate large language models under few-shot training conditions.

Practical significance. The developed specialized solution can be used for automatic and semi-automated term tagging in Russian-language scientific texts, as well as for creating and expanding terminological corpora. The results of the comparative analysis demonstrate the feasibility of using large language models as an auxiliary tool or baseline.

Pages: 71-75

For citation

Biderina K.K., Grebenkov D.I. Seq2Seq approach and large language models for term extraction from russian scientific texts. Highly Available Systems. 2026. V. 22. № 1. P. 71−75. DOI: https://doi.org/10.18127/j20729472-202601-14 (in Russian)

References

Mamontova A., Ishhenko R., Voronczov K. RuTermEval-2024: Cross-domain Automatic Term Extraction and Classification in Russian scientific texts. Trudy` Mezhdunar. konf. «Dialog 2025». 2025.
Kageura K., Umino B. Methods of automatic term recognition: A review. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication. 1996. V. 3. № 2. P. 259–289.
Pazienza M.T., Pennacchiotti M., Zanzotto F.M. Terminology extraction: an analysis of linguistic and statistical approaches. Knowledge mining: Proceedings of the NEMIS 2004 final conference. Berlin, Heidelberg: Springer. 2005. P. 255–279. https://doi.org/10.1007/3-540-32394-5_20
Terryn A.R. et al. Analysing the impact of supervised machine learning on automatic term extraction: HAMLET vs TermoStat. Proceedings of the international conference on recent advances in natural language processing (RANLP 2019). 2019. P. 1012–1021.
Terryn A.R. et al. Termeval 2020: Shared task on automatic term extraction using the annotated corpora for term extraction research (acter) dataset. Proceedings of the 6th International Workshop on Computational Terminology. 2020. P. 85–94.
Hazem A. et al. Termeval 2020: Taln-ls2n system for automatic term extraction. International Workshop on Computational Terminology (COMPUTERM). 2020.
Lang C. et al. Transforming term extraction: Transformer-based approaches to multilingual term extraction across domains. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. P. 3607–3620.
Tran H.T.H. et al. Ensembling transformers for cross-domain automatic term extraction. International Conference on Asian Digital Libraries. Springer. 2022. P. 90–100.
Banerjee S., Chakravarthi B.R., McCrae J.P. Large language models for few-shot automatic term extraction. International Conference on Applications of Natural Language to Information Systems. Lecture Notes in Computer Science. V. 14762. Cham: Springer. 2024.
P. 137–150. https://doi.org/10.1007/978-3-031-70239-6_10
Rozhkov I., Lukachevitch N. Methods for Recognizing Nested Terms. arXiv:2504.16007. 2025. https://doi.org/10.48550/arXiv.2504.16007
Semak V.V., Bol`shakova E.I. Comparing Transformer-Based Approaches for Term Recognition in Russian texts. Trudy` Mezhdunar. konf. «Dialog 2025». 2025.

Date of receipt: 24.02.2026

Approved after review: 26.02.2026

Accepted for publication: 10.03.2026