B.B. Baishev1, A.P. Khalov2
1 Nazarbayev University (Astana, Kazakhstan)
2 FRC CSC RAS (Moscow, Russia)
1 baishevbasar@gmail.com, 2 khalov.a@phystech.edu
Problem Statement. Short text processing in technical support systems suffers from class imbalance and noise. Traditional balancing (oversampling) fails on noisy data.
Objective. To improve classification accuracy through preliminary density-based cleaning of the training dataset.
Results. A multi-stage pipeline was developed, which reduced noise by 16.53%. The model accuracy (R@3 metric) reached 97.4%. The advantage of the data cleaning strategy over synthetic augmentation was experimentally proven.
Practical Significance. Reducing ticket resolution time and decreasing operator workload by automating ticket routing with 97% reliability (within the top-3 recommendations).
Baishev B.B., Khalov A.P. A data-centric approach to short text classification under class imbalance. Highly Available Systems. 2026. V. 22. No 1. P. 8?11. DOI: https://doi.org/10.18127/j20729472-202601-01 (in Russian)
- Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P. SMOTE: synthetic minority over-samplingtechnique. J. Artif. Intell. Res. 2002. V. 16. P. 321-357. https://doi.org/10.1613/jair.953
- Zha D. et al. Data-centric Artificial Intelligence: A Survey. ACM Comput. Surv. 2025. V. 57. No Art. 129. https://doi.org/10.48550/arXiv.2303.10158
- Salton G., Buckley C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988. V. 24. No 5. P. 513-523. https://doi.org/10.1016/0306-4573(88)90021-0
- Batiuk T., Dosyn D. Intellectual analysis of textualdata in social networks using BERT and XGBOOST. Visn. Nac. Univ. L?viv. Politeh.: Inf. Sist. Merezi. 2025. V. 17. P. 44-60. https://doi.org/10.23939/sisn2025.17.044
- Zemp M. Text classification of servicedesk tickets. Masters thesis. Zurich Univ. Appl. Sci. 2021.
- Parmar M., Tiwari A. Enhancing text classification performance using stackingensemble. Proc. 5th Int. Conf. Mobile Comput. Sustain. Inform. (ICMCSI). 2024. P. 166-174. https://doi.org/10.1109/ICMCSI61536.2024.00031
- Akhbardeh F. et al. Handling extremeclass imbalance in technical logbook datasets. Proc. ACL-IJCNLP. 2021. P. 4034-4045. https://doi.org/10.18653/v1/2021.acl-long.312
- Padurariu C., Breaban M.E. Dealing with data imbalance in text classification. Procedia Comput. Sci. 2019. V. 159. P. 736-745. https://doi.org/10.1016/j.procs.2019.09.229
- Asyaky M.S., Mandala R. Improving the performance of HDBSCAN on shorttext clustering. Proc. 8th Int. Conf. Adv. Informat. (ICAICTA). 2021. P. 1-6. https://doi.org/10.1109/ICAICTA53211.2021.9640285 1
- McInnes L., Healy J., Astels S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017. V. 2. No 11. P. 205. https://doi.org/10.21105/joss.00205 1
- Khalov A.P., Ataeva O.M. Automatic and semi-automatic methods for constructing adomain knowledge graph and ontologyexpansion. Russian Digital Libraries Journal. 2025. V. 28. No 6. P. 1481-1519. 1
- Wolpert D.H. Stacked generalization. Neural Netw. 1992. V. 5. No 2. P. 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1 1
- Micci-Barreca D. A preprocessing scheme for high-cardinality categorical attributes. SIGKDD Explor. Newsl. 2001. V. 3. No 1. P. 27-32. https://doi.org/10.1145/507533.507538 1
- Chen T., Guestrin C. XGBoost: A scalabletree boosting system. Proc. 22nd ACM SIGKDD (KDD). 2016. P. 785-794. https://doi.org/10.1145/2939672.2939785 1
- Akiba T. et al. Optuna: A next-generation hyperparameter optimization framework. Proc. 25th ACM SIGKDD (KDD). 2019. P. 2623-2631. https://doi.org/10.1145/3292500.3330701

