500 rub
Journal Highly available systems №1 for 2026 г.
Article in number:
A data-centric approach to short text classification under class imbalance
Type of article: scientific article
DOI: https://doi.org/10.18127/j20729472-202601-01
UDC: 004
Authors:

B.B. Baishev1, A.P. Khalov2

1 Nazarbayev University (Astana, Kazakhstan)

2 FRC CSC RAS (Moscow, Russia)

1 baishevbasar@gmail.com, 2 khalov.a@phystech.edu

Abstract:

Problem Statement. Short text processing in technical support systems suffers from class imbalance and noise. Traditional balancing (oversampling) fails on noisy data.

Objective. To improve classification accuracy through preliminary density-based cleaning of the training dataset.

Results. A multi-stage pipeline was developed, which reduced noise by 16.53%. The model accuracy (R@3 metric) reached 97.4%. The advantage of the data cleaning strategy over synthetic augmentation was experimentally proven.

Practical Significance. Reducing ticket resolution time and decreasing operator workload by automating ticket routing with 97% reliability (within the top-3 recommendations).

Pages: 8-11
For citation

Baishev B.B., Khalov A.P. A data-centric approach to short text classification under class imbalance. Highly Available Systems. 2026. V. 22. No 1. P. 8?11. DOI: https://doi.org/10.18127/j20729472-202601-01 (in Russian)

References
  1. Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P. SMOTE: synthetic minority over-samplingtechnique. J. Artif. Intell. Res. 2002. V. 16. P. 321-357. https://doi.org/10.1613/jair.953
  2. Zha D. et al. Data-centric Artificial Intelligence: A Survey. ACM Comput. Surv. 2025. V. 57. No Art. 129. https://doi.org/10.48550/arXiv.2303.10158
  3. Salton G., Buckley C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988. V. 24. No 5. P. 513-523. https://doi.org/10.1016/0306-4573(88)90021-0
  4. Batiuk T., Dosyn D. Intellectual analysis of textualdata in social networks using BERT and XGBOOST. Visn. Nac. Univ. L?viv. Politeh.: Inf. Sist. Merezi. 2025. V. 17. P. 44-60. https://doi.org/10.23939/sisn2025.17.044
  5. Zemp M. Text classification of servicedesk tickets. Masters thesis. Zurich Univ. Appl. Sci. 2021.
  6. Parmar M., Tiwari A. Enhancing text classification performance using stackingensemble. Proc. 5th Int. Conf. Mobile Comput. Sustain. Inform. (ICMCSI). 2024. P. 166-174. https://doi.org/10.1109/ICMCSI61536.2024.00031
  7. Akhbardeh F. et al. Handling extremeclass imbalance in technical logbook datasets. Proc. ACL-IJCNLP. 2021. P. 4034-4045. https://doi.org/10.18653/v1/2021.acl-long.312
  8. Padurariu C., Breaban M.E. Dealing with data imbalance in text classification. Procedia Comput. Sci. 2019. V. 159. P. 736-745. https://doi.org/10.1016/j.procs.2019.09.229
  9. Asyaky M.S., Mandala R. Improving the performance of HDBSCAN on shorttext clustering. Proc. 8th Int. Conf. Adv. Informat. (ICAICTA). 2021. P. 1-6. https://doi.org/10.1109/ICAICTA53211.2021.9640285 1
  10. McInnes L., Healy J., Astels S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017. V. 2. No 11. P. 205. https://doi.org/10.21105/joss.00205 1
  11. Khalov A.P., Ataeva O.M. Automatic and semi-automatic methods for constructing adomain knowledge graph and ontologyexpansion. Russian Digital Libraries Journal. 2025. V. 28. No 6. P. 1481-1519. 1
  12. Wolpert D.H. Stacked generalization. Neural Netw. 1992. V. 5. No 2. P. 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1 1
  13. Micci-Barreca D. A preprocessing scheme for high-cardinality categorical attributes. SIGKDD Explor. Newsl. 2001. V. 3. No 1. P. 27-32. https://doi.org/10.1145/507533.507538 1
  14. Chen T., Guestrin C. XGBoost: A scalabletree boosting system. Proc. 22nd ACM SIGKDD (KDD). 2016. P. 785-794. https://doi.org/10.1145/2939672.2939785 1
  15. Akiba T. et al. Optuna: A next-generation hyperparameter optimization framework. Proc. 25th ACM SIGKDD (KDD). 2019. P. 2623-2631. https://doi.org/10.1145/3292500.3330701
Date of receipt: 24.02.2026
Approved after review: 26.02.2026
Accepted for publication: 10.03.2026