Comparison of data classification methods in machine learning based on scientific publications on the nuclear fuel cycle

350 rub

Journal Highly available systems №1 for 2025 г.

Article in number:

Type of article: scientific article

DOI: https://doi.org/10.18127/j20729472-202501-03

UDC: 004.652:621.039.54-016

Keywords: Data classification model; machine learning algorithm; scientific publication; data classification; logistic regression; sup-port vector method; decision tree; random forest; gradient bousting; nuclear technologies

Authors:

R.R. Tukumbetova1, M.S. Ulizko2, T.V. Korenkova3, A.A. Artamonov4

1−4 National Research Nuclear University MEPhI (Moscow, Russia)
1 rrtukumbetova@mephi.ru, 2 msulizko@mephi.ru, 3 korenkova.tanya@mail.ru, 4 aaartamonov@mephi.ru

Abstract:

The paper compares the effectiveness of machine learning methods in solving the problem of classifying bibliographic records of scientific publications from abstract databases on six nuclear fuel cycle technologies. To solve the problem of selecting scientific publications related to the considered technologies and classifying them by technologies, the authors applied the following most common discriminative machine learning models: logistic regression; SVM (support vector machines) or support vector method; decision tree; random forest; gradient bousting. Bibliographic records from the International Nuclear Information System INIS were used for the first iteration of model training. An error matrix was generated based on the results of each model. The authors used the following classification quality metrics to evaluate the performance of the models and compare results: proportion of correct answers, accuracy, completeness, F1-measure, training time, and prediction time. Following the first iteration of training, the data was also tested on data from INIS. The performance of all five models was satisfactory with minor differences in metrics. Next, the authors tested the performance of the models on data from the Scopus abstract database, manually tagged by a nuclear expert. The validation of the developed models on Scopus data revealed a decrease in the performance of the models, which may be due to the difference in the size of the annotations of the bibliographic records of the publications on which the models were trained and the data used for testing. Given that the Scopus data has a more diverse and detailed annotation, the authors performed model retraining using both INIS and Scopus data, believing that this process would allow the model to adapt to the new information and improve its ability to generalise. After performing model pre-training, there was a significant improvement in the results of the classification quality metrics compared to the previous result. Training the models on the two datasets allowed them to better capture patterns in the data, which had a positive impact on prediction quality.

When comparing all the metrics considered in the paper, the support vector method and gradient bousting showed the best results for most of the metrics. In this study, both methods showed high classification accuracy. Although gradient bousting performed well on F1 metric, the training speed is much longer than that of the support vector method. At the same time, the support vector method is quite a bit behind in terms of training accuracy, but manages to train much faster than gradient bousting.

Pages: 25-38

For citation

Tukumbetova R.R., Ulizko M.S., Korenkova T.V., Artamonov A.A. Comparison of data classification methods in machine learning based on scientific publications on the nuclear fuel cycle. Highly Available Systems. 2025. V. 21. № 1. P. 25−38. DOI: https://doi.org/ 10.18127/j20729472-202501-03 (in Russian)

References

Malugin, M., Antonov, E., Artamonov, A. Designing a System for Monitoring the Publication Activity of the Scientific Organization. Physics of Particles and Nuclei. 2024. 55(3). P. 554–556.
Gusev P.Yu. Razrabotka sistemy` klassifikacii tekstov po nauchny`m special`nostyam s primeneniem metodov mashinnogo obucheniya. Vestnik Novosibirskogo gosudarstvennogo universiteta. Ser.: Informacionny`e texnologii. 2021. T. 19. Vy`p. 1. S. 39–47.
Kozlov P.A. i dr. Sravnitel`ny`j analiz binarny`x klassifikatorov na massive nauchny`x publikacij. Zavodskaya laboratoriya. Diagnostika materialov. 2022. T. 88. Vy`p. 7. S. 79–87.
Mezhdunarodnaya sistema yadernoj informacii (INIS). Available at: https://www.iaea.org/ru/resursy/mezhdunarodnaya-sistema-yadernoy-informacii-inis, accessed 30.06.2024.
Scikit-learn, Logistic Regression. Available at: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression, accessed 04.11.2024.
Scikit-learn, Support Vector Machines. Available at: https://scikit-learn.org/stable/modules/svm.html#svm-classification, accessed 04.11.2024.
Van der Maaten L., Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008. V. 9. Iss. 11.
Scikit-learn, Decision Trees. Available at: https://scikit-learn.org/stable/modules/tree.html accessed 04.11.2024.
Scikit-learn, Random Forests. Available at: https://scikit-learn.org/stable/modules/ensemble.html#forest, accessed 04.11.2024.
Scikit-learn, Gradient-boosted trees. Available at: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting, accessed 04.11.2024.
Dube, Lindani & Verster, Tanja. Assessing the performance of machine learning models for default prediction under missing data and class imbalance: A simulation study. ORiON. 2024. V. 40. P. 1–24.

Date of receipt: 16.01.2025

Approved after review: 27.01.2025

Accepted for publication: 27.02.2025