B.T. Gizatullin1, O.A. Nevzorova2
1-2 Kazan (Volga Region) Federal University (Kazan, Russia)
1 gizat.blt@gmail.com, 2 onevzoro@gmail.com
Problem statement. The Universal Decimal Classification (UDC) is a hierarchical system for indexing scientific publications; a paper may be assigned one or several codes. Manual labeling is labor-intensive and may be inconsistent, motivating automatic assignment. For Russian mathematical texts, the task is complicated by domain-specific terminology and formulas. Prior work includes classical ML and similarity-based methods [1], BERT-based multi-label models [2], hybrid recommender-style schemes [3], methods with explicit modeling of the UDC hierarchy [4], and large language models [5].
Purpose. We comparatively analyze approaches to automatic UDC classification for Russian mathematical publications and identify effective combinations of text representations and models in two settings: single-label (primary code) and multi-label (all codes).
Methods. We evaluate TF-IDF, Word2Vec and Sci-Rus-tiny [6] with logistic regression and CatBoost on 4,194 Math-Net.Ru papers using a temporal split (train up to 2020, test starts from 2021) and truncated UDC codes.
Results. For single-label, TF-IDF + logistic regression performs best and achieves Accuracy@3 0.97, suggesting most errors remain within nearby branches. TF-IDF + CatBoost is close but more imbalance-sensitive; Word2Vec and Sci-Rus-tiny are. For multi-label, per-class thresholds are tuned on validation; TF-IDF + logistic regression leads with higher top-level scores. Hard cases are within-branch confusions; informative TF-IDF terms align with topical codes.
Practical significance. The pipeline enables automated categorization in scientific archives and libraries, UDC recommendations, and detection of potentially misassigned codes.
Gizatullin B.T., Nevzorova O.A. Comparative analysis of approaches to automatic classification of mathematical scientific articles based on UDC codes. Highly Available Systems. 2026. V. 22. № 1. P. 21−24. DOI: https://doi.org/10.18127/j20729472-202601-04 (in Russian)
- Romanov A., Lomotin K., Kozlova E. Automatization of Scientific Articles Classification According to Universal Decimal Classifier. Supplementary Proceedings of the Sixth International Conference on Analysis of Images, Social Networks and Texts (AIST 2017). CEUR Workshop Proceedings. 2017. V. 1975. P. 122–133.
- Roy A., Ghosh S. Automated Subject Identification using the Universal Decimal Classification: The ANN Approach. SRELS Journal of Information and Knowledge. 2023. V. 60. № 2. P. 69–76. DOI: 10.17821/srels/2023/v60i2/170963
- Borovič M., Ojsteršek M., Strnad M. A Hybrid Approach to Recommending Universal Decimal Classification Codes for Cataloguing in Slovenian Digital Libraries. IEEE Access. 2022. V. 10. P. 85595–85605. DOI: 10.1109/ACCESS.2022.3198706
- Mamedov V., Kovalevsky D., Morozov D., Stolyarov S., Ospichev S. Hierarchical classification of scientific articles using deep learning (using the UDC hierarchy as an example). Modeling and Analysis of Information Systems. 2025. V. 32. № 1. P. 80–94. DOI: 10.18255/1818-1015-2025-1-80-94
- Borovič M., Tomovski E., Li Dobnik T., Majninger S. Evaluating Proprietary and Open-Weight Large Language Models as Universal Decimal Classification Recommender Systems. Applied Sciences. 2025. V. 15. № 14. Art. 7666. DOI: 10.3390/app15147666
- Gerasimenko N., Vatolin A., Ianina A., Vorontsov K. SciRus: Tiny and Powerful Multilingual Encoder for Scientific Texts. Doklady Mathematics. 2024. V. 110. Suppl. 1. P. S193–S202. DOI: 10.1134/S1064562424602178

