350 rub
Journal Neurocomputers №5 for 2022 г.
Article in number:
Increasing the efficiency of spam filtering based on machine learning methods in messages different nature
Type of article: scientific article
DOI: https://doi.org/10.18127/j19998554-202205-01
UDC: 519.6
Authors:

I.А. Chizhova1, E.I. Kublik2, M.S. Chipchagov3, A.I. Labintsev4

1-4 Financial University under the Government of the Russian Federation (Moscow, Russia)

Abstract:

The current article is devoted to research of machine learning algorithms for spam filtering. The main object of study was spam filters, characterized by a variety of methods and specialized data for training and testing.

In the binary classification task messages are divided into spam and non-spam. A comparative analysis of the developed models was conducted. Proposed models were based on public data sets used to identify SMS spam, email spam, and web spam pages for commercial and scientific purposes.

The structural destruction time (stabilization time) and the system dimension were selected as the basic parameters of the model. The relationship between them and several factors was revealed the number of epicenters of structural failure, the edge density of the graph of the network system structure, and the distance between current and maximum load ranges of the system.

As a result of the research, we built several spam filtering models and were able to draw the following conclusions:

k-Nearest Neighbor is not suitable for text classification tasks or requires more serious tuning of hyperparameters;

the Naive Bayes classifier and the SVM are easy to implement and fast;

the Logistic regression method shows the greatest efficiency in the task of classifying short texts;

the Multilayer Perceptron Classifier and CatBoost Classifier are universal, but these methods require a lot of time.

Pages: 5-18
For citation

Chizhova I.А., Kublik E.I., Chipchagov M.S., Labintsev A.I. Increasing the efficiency of spam filtering based on machine learning methods in messages different nature. Neurocomputers. 2022. V. 24. № 5. Р. 5-18. DOI: https://doi.org/10.18127/j19998554-202205-01 (in Russian)

References
  1. Nizamitdinov A.I., Inomov B.B. Algoritmy mashinnogo obuchenija dlja klassifikacii teksta. Vestnik PITTU imeni akademika M.S. Osimi. 2020. № 1(14). S. 27-35 (in Russian).
  2. Aljedani N., Alotaibi R., Taileb M. HMATC: Hierarchical multi-label Arabic text classification model using machine learning. Egyptian Informatics Journal. September 2020. Available online 22.
  3. Bommert A., Sun X., Bischl B., Rahnenführer J., Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics and Data Analysis. March 2020. V. 143. Р. 106839.
  4. Dada E.G., Bassi J.S., Chiroma H., Abdulhamid S.M., Adetunmbi A.O., Ajibuwa O.E. Machine learning for email spam filtering: review, approaches and open research problems. Heliyon. June 2019. V. 5. Is. 6.
  5. Dedeturk B.K., Akay B. Spam filtering using a logistic regression model trained by an artificial bee colony algorithm. Applied Soft Computing. June 2020. V. 91. Р. 106229.
  6. Dialeab M., Celika T., Van Der Walt C. Unsupervised feature learning for spam email filtering. Computers and Electrical Engineering. March 2019. V. 74. P. 89-104.
  7. El Hindi K.M., Aljulaidan R.R., AlSalman H. Lazy fine-tuning algorithms for naïve Bayesian text classification. Applied Soft Computing Journal. November 2020. V. 96. Р. 106652.
  8. Extended Arabic Web Spam Data. https://sites.google.com/site/heiderawahsheh/home/web-spam-2011-datasets/arabic-web-spam-2011-dataset, last accessed 2020/12/21.
  9. Fang W., Luo H., Xud S., Love P.E.D., Lu Z., Ye C. Automated text classification of near-misses from safety reports: An improved deep learning approach. Advanced Engineering Informatics. April 2020. V. 44. Р. 101060.
  10. Kim J., Jang S., Park E., Choi S. Text classification using capsules. Neurocomputing. 1 February 2020. V. 376. P. 214-221.
  11. Kou G., Yang P., Peng Y., Xiao F., Chen Y., Alsaadi F.E. Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Applied Soft Computing Journal. January 2020. V. 86. Р. 105836.
  12. Li Q., Li P., Mao K., Yat-Man Lo E. Improving convolutional neural network for text classification by recursive data pruning. Neurocomputing. 13 November 2020. V. 414. P. 143-152.
  13. Li J., Chen Z., Wang Z., Chang Y.I. Active learning in multiple-class classification problems via individualized binary models. Computational Statistics and Data Analysis. May 2020. V. 145. Р. 106911.
  14. Liang D., Yi B. Two-stage three-way enhanced technique for ensemble learning in inclusive policy text classification. Information Sciences. 8 February 2021. V. 547. P. 271-288.
  15. Liu X., Mou L., Cui H., Lu Z., Song S. Finding decision jumps in text classification. Neurocomputing. 2 January 2020. V. 371.
    P. 177-187.
  16. Luca S.E., Pimentel M.A.F., Watkinson P.J., Clifton D.A. Point process models for novelty detection on spatial point patterns and their extremes. Computational Statistics and Data Analysis. September 2018. V. 125. P. 86-103.
  17. Méndez J.R., Cotos-Yañez T.R., Ruano-Ordás D. A new semantic-based feature selection method for spam filtering. Applied Soft Computing Journal. March 2019. V. 76. P. 89-104.
  18. Roy P.K., Singh J.P., Banerjee S. Deep learning to filter SMS Spam. Future Generation Computer Systems. January 2020. V. 102. P. 524-533.
  19. Trittenbach H., Englhardt A., Böhm K. An overview and a benchmark of active learning for outlier detection with one-class classifiers. Expert Systems with Applications. 15 April 2021. V. 168. Р. 114372.
  20. Wang R., Ridley R., Su X., Qu W., Dai X. A novel reasoning mechanism for multi-label text classification. Information Processing and Management. March 2021. V. 58. Is. 2. Р. 102441.
  21. Watanabe W.M., Felizardo K.R., Candido A. (Jr.), de Souza E.F., de Campos Neto J.E., Vijaykumar N.L. Reducing efforts of software engineering systematic literature reviews updates using text classification. Information and Software Technology. December 2020. V. 128. Р. 106395.
  22. Xu J., Du Q. TextTricker: Loss-based and gradient-based adversarial attacks on text classification models. Engineering Applications of Artificial Intelligence. June 2020. V. 9. Р.103641.
  23. Zhan Z., Hou Z., Yang Q., Zhao J., Zhang Y., Hu C. Knowledge attention sandwich neural network for text classification. Neurocomputing. 7 September 2020. V. 406. P. 1-11.
  24. Berezkin D.V., Shi Zhun'fan, Li Tjenczjao. Analiz metodov mashinnogo obuchenija dlja obnaruzhenija moshennicheskih tranzakcij s bankovskimi kartam. Dinamika slozhnyh sistem. 2021. T. 15. № 2. S. 5−13 (In Russian).
Date of receipt: 18.08.2022
Approved after review: 01.09.2022
Accepted for publication: 22.09.2022