Comparative analysis of approaches to automatic classification of mathematical scientific articles based on UDC codes

500 rub

Journal Highly available systems №1 for 2026 г.

Article in number:

Type of article: scientific article

DOI: https://doi.org/10.18127/j20729472-202601-04

UDC: 004.912

Keywords: Automatic classification Universal Decimal Classification (UDC) scientific text processing machine learning text vectorization comparative analysis

Authors:

B.T. Gizatullin1, O.A. Nevzorova2

1-2 Kazan (Volga Region) Federal University (Kazan, Russia)

1 gizat.blt@gmail.com, 2 onevzoro@gmail.com

Abstract:

Problem statement. The Universal Decimal Classification (UDC) is a hierarchical system for indexing scientific publications; a paper may be assigned one or several codes. Manual labeling is labor-intensive and may be inconsistent, motivating automatic assignment. For Russian mathematical texts, the task is complicated by domain-specific terminology and formulas. Prior work includes classical ML and similarity-based methods [1], BERT-based multi-label models [2], hybrid recommender-style schemes [3], methods with explicit modeling of the UDC hierarchy [4], and large language models [5].

Purpose. We comparatively analyze approaches to automatic UDC classification for Russian mathematical publications and identify effective combinations of text representations and models in two settings: single-label (primary code) and multi-label (all codes).

Methods. We evaluate TF-IDF, Word2Vec and Sci-Rus-tiny [6] with logistic regression and CatBoost on 4,194 Math-Net.Ru papers using a temporal split (train up to 2020, test starts from 2021) and truncated UDC codes.

Results. For single-label, TF-IDF + logistic regression performs best and achieves Accuracy@3 0.97, suggesting most errors remain within nearby branches. TF-IDF + CatBoost is close but more imbalance-sensitive; Word2Vec and Sci-Rus-tiny are. For multi-label, per-class thresholds are tuned on validation; TF-IDF + logistic regression leads with higher top-level scores. Hard cases are within-branch confusions; informative TF-IDF terms align with topical codes.

Practical significance. The pipeline enables automated categorization in scientific archives and libraries, UDC recommendations, and detection of potentially misassigned codes.

Pages: 21-24

For citation

Gizatullin B.T., Nevzorova O.A. Comparative analysis of approaches to automatic classification of mathematical scientific articles based on UDC codes. Highly Available Systems. 2026. V. 22. № 1. P. 21−24. DOI: https://doi.org/10.18127/j20729472-202601-04 (in Russian)

References

Romanov A., Lomotin K., Kozlova E. Automatization of Scientific Articles Classification According to Universal Decimal Classifier. Supplementary Proceedings of the Sixth International Conference on Analysis of Images, Social Networks and Texts (AIST 2017). CEUR Workshop Proceedings. 2017. V. 1975. P. 122–133.
Roy A., Ghosh S. Automated Subject Identification using the Universal Decimal Classification: The ANN Approach. SRELS Journal of Information and Knowledge. 2023. V. 60. № 2. P. 69–76. DOI: 10.17821/srels/2023/v60i2/170963
Borovič M., Ojsteršek M., Strnad M. A Hybrid Approach to Recommending Universal Decimal Classification Codes for Cataloguing in Slovenian Digital Libraries. IEEE Access. 2022. V. 10. P. 85595–85605. DOI: 10.1109/ACCESS.2022.3198706
Mamedov V., Kovalevsky D., Morozov D., Stolyarov S., Ospichev S. Hierarchical classification of scientific articles using deep learning (using the UDC hierarchy as an example). Modeling and Analysis of Information Systems. 2025. V. 32. № 1. P. 80–94. DOI: 10.18255/1818-1015-2025-1-80-94
Borovič M., Tomovski E., Li Dobnik T., Majninger S. Evaluating Proprietary and Open-Weight Large Language Models as Universal Decimal Classification Recommender Systems. Applied Sciences. 2025. V. 15. № 14. Art. 7666. DOI: 10.3390/app15147666
Gerasimenko N., Vatolin A., Ianina A., Vorontsov K. SciRus: Tiny and Powerful Multilingual Encoder for Scientific Texts. Doklady Mathematics. 2024. V. 110. Suppl. 1. P. S193–S202. DOI: 10.1134/S1064562424602178

Date of receipt: 24.02.2026

Approved after review: 26.02.2026

Accepted for publication: 10.03.2026