A data-centric approach to short text classification under class imbalance

500 rub

Journal Highly available systems №1 for 2026 г.

Article in number:

Type of article: scientific article

DOI: https://doi.org/10.18127/j20729472-202601-01

UDC: 004

Keywords: NLP class imbalance short texts data cleaning density-based clustering

Authors:

B.B. Baishev1, A.P. Khalov2

1 Nazarbayev University (Astana, Kazakhstan)

2 FRC CSC RAS (Moscow, Russia)

1 baishevbasar@gmail.com, 2 khalov.a@phystech.edu

Abstract:

Problem Statement. Short text processing in technical support systems suffers from class imbalance and noise. Traditional balancing (oversampling) fails on noisy data.

Objective. To improve classification accuracy through preliminary density-based cleaning of the training dataset.

Results. A multi-stage pipeline was developed, which reduced noise by 16.53%. The model accuracy (R@3 metric) reached 97.4%. The advantage of the data cleaning strategy over synthetic augmentation was experimentally proven.

Practical Significance. Reducing ticket resolution time and decreasing operator workload by automating ticket routing with 97% reliability (within the top-3 recommendations).

Pages: 8-11

For citation

Baishev B.B., Khalov A.P. A data-centric approach to short text classification under class imbalance. Highly Available Systems. 2026. V. 22. No 1. P. 8?11. DOI: https://doi.org/10.18127/j20729472-202601-01 (in Russian)

References

Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P. SMOTE: synthetic minority over-samplingtechnique. J. Artif. Intell. Res. 2002. V. 16. P. 321-357. https://doi.org/10.1613/jair.953
Zha D. et al. Data-centric Artificial Intelligence: A Survey. ACM Comput. Surv. 2025. V. 57. No Art. 129. https://doi.org/10.48550/arXiv.2303.10158
Salton G., Buckley C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988. V. 24. No 5. P. 513-523. https://doi.org/10.1016/0306-4573(88)90021-0
Batiuk T., Dosyn D. Intellectual analysis of textualdata in social networks using BERT and XGBOOST. Visn. Nac. Univ. L?viv. Politeh.: Inf. Sist. Merezi. 2025. V. 17. P. 44-60. https://doi.org/10.23939/sisn2025.17.044
Zemp M. Text classification of servicedesk tickets. Masters thesis. Zurich Univ. Appl. Sci. 2021.
Parmar M., Tiwari A. Enhancing text classification performance using stackingensemble. Proc. 5th Int. Conf. Mobile Comput. Sustain. Inform. (ICMCSI). 2024. P. 166-174. https://doi.org/10.1109/ICMCSI61536.2024.00031
Akhbardeh F. et al. Handling extremeclass imbalance in technical logbook datasets. Proc. ACL-IJCNLP. 2021. P. 4034-4045. https://doi.org/10.18653/v1/2021.acl-long.312
Padurariu C., Breaban M.E. Dealing with data imbalance in text classification. Procedia Comput. Sci. 2019. V. 159. P. 736-745. https://doi.org/10.1016/j.procs.2019.09.229
Asyaky M.S., Mandala R. Improving the performance of HDBSCAN on shorttext clustering. Proc. 8th Int. Conf. Adv. Informat. (ICAICTA). 2021. P. 1-6. https://doi.org/10.1109/ICAICTA53211.2021.9640285 1
McInnes L., Healy J., Astels S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017. V. 2. No 11. P. 205. https://doi.org/10.21105/joss.00205 1
Khalov A.P., Ataeva O.M. Automatic and semi-automatic methods for constructing adomain knowledge graph and ontologyexpansion. Russian Digital Libraries Journal. 2025. V. 28. No 6. P. 1481-1519. 1
Wolpert D.H. Stacked generalization. Neural Netw. 1992. V. 5. No 2. P. 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1 1
Micci-Barreca D. A preprocessing scheme for high-cardinality categorical attributes. SIGKDD Explor. Newsl. 2001. V. 3. No 1. P. 27-32. https://doi.org/10.1145/507533.507538 1
Chen T., Guestrin C. XGBoost: A scalabletree boosting system. Proc. 22nd ACM SIGKDD (KDD). 2016. P. 785-794. https://doi.org/10.1145/2939672.2939785 1
Akiba T. et al. Optuna: A next-generation hyperparameter optimization framework. Proc. 25th ACM SIGKDD (KDD). 2019. P. 2623-2631. https://doi.org/10.1145/3292500.3330701

Date of receipt: 24.02.2026

Approved after review: 26.02.2026

Accepted for publication: 10.03.2026