The effect of text length on the classification result of LLM-generated russian-language texts

500 rub

Journal Dynamics of Complex Systems - XXI century №3 for 2026 г.

Article in number:

Type of article: scientific article

DOI: 10.18127/j19997493-202603-02

UDC: 004.89

Keywords: Text classification large language models machine learning deep neural networks Russian black box deep fake llm

Authors:

N.S. Kurganov1, P.G. Klyucharev2

1, 2 Bauman Moscow State Technical University (Moscow, Russia)
1 nikk178@mail.ru, 2 pk.iu8@yandex.ru

Abstract:

Problem statement. The increasing quality of text generation by large language models significantly complicates the task of automatically detecting generated and human texts, especially for texts in Russian. However, the effect of text length on the classification result remains poorly understood. In practical terms, classifiers work with texts of various sizes, but most approaches do not take length into account as a significant factor, which can lead to distortion of quality metrics and decrease the stability of models. In this regard, the task of investigating the dependence of the classification efficiency of LLM-generated Russian-language texts on their length is urgent.

Goal. To improve the quality of methods for determining LLM-generated Russian-language texts of different lengths.

Results. During the research, a text tokenization method was presented, which allows taking into account Russian-language texts of different lengths. This method was used to vectorize text data, which is later used to train methods for determining LLM-generated texts. This method has shown a fairly high efficiency in determining LLM-generated Russian-language texts of different lengths during the training of a text classifier based on a transformer encoder of the BERT architecture. The increase in the accuracy of determining LLM-generated texts using the example of the Russian language compared to the standard tokenization method for the maximum text length has improved from 30% to 50% for a classifier based on a transformer encoder. The accuracy of text classification ranged from 73% to 97% for the various validation datasets that are considered in this paper, which is a fairly good result relative to other methods for determining LLM-generated texts.

Practical significance. The results of the study can be used to prepare text data and train neural network methods to solve the problem of determining LLM-generated texts of different lengths for almost any language in which the text is written.

Pages: 18-27

For citation

Kurganov N.S., Klyucharev P.G. The effect of text length on the classification result of LLM-generated russian-language texts. Dynamics of complex systems. 2026. V. 20. № 3. P. 18−27. DOI: 10.18127/j19997493-202603-02 (in Russian).

References

Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan. A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions. Computational Linguistics. 2025. V. 51.
Wang, Yuxia, Mansurov J., Ivanov P. M4: Multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics. 2024. V. 1. P. 1369–1407.
Macko D., Moro R., Uchendu A., Lucas J. MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. V. 1. P. 9960–9987.
Shamardina T., Mikhailov V., Chernianskii D., Fenogenova A. Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference «Dialogue 2022». 2022. V. 1. P. 1–15.
Gritsay G., Grabovoy A., Chekhovich Yu. Open access dataset for machine-generated text detection in Russian. Mendeley Data. 2022. V. 1.
Hurst A., Lerer A., Goucher A.P. Perelman A. GPT-4o System Card. Computation and Language. 2024. V. 1. P. 1–33.
Patterson D., Gonzalez J., Hölzle U., Le Q. The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink. IEEE. 2022. V. 1. P. 18–28.
Lin Xi V., Mihaylov T., Artetxe M., Wang T. Few-shot Learning with Multilingual Language Models. EMNLP 2022. 2022. V. 1. P. 9019–9052.
Zmitrovich D., Abramov A., Kalmykov A., Tikhonova M. A Family of Pretrained Transformer Language Models for Russian. LREC-COLING 2024. 2024. V. 1. P. 507–524.
Devlin J., Chang Ming-Wei, Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Computation and Language. 2019.
Kurganov N.S. Razrabotka metoda opredeleniya sgenerirovanny`x bol`shimi yazy`kovy`mi modelyami russkoyazy`chny`x tekstov. Sb. nauch. trudov XXVII Mezhdunar. nauchno-prakt. konf. «Nejroinformatika 2025». 2025. S. 58–69.

Date of receipt: 28.01.2026

Approved after review: 04.03.2026

Accepted for publication: 29.04.2026