Impact of packet sampling on classification of network traffic by machine learning methods

350 rub

Journal Neurocomputers №4 for 2016 г.

Article in number:

Keywords: network traffic classification machine learning classification algorithms C4.5 SVM AdaBoost NaiveBayes BayesNet sampling

Authors:

O.I. Sheluhin - Dr.Sc. (Eng.), Professor, Head of Department "Information Security and Automation", Moscow Technical University of Communications and Informatics. E-mail: sheluhin@mail.ru Yu.A. Kalugin - Post-graduate Student, Department of "Information Security and Automation", Moscow Technical University of Communications and Informatics. E-mail: derron210@gmail.com

Abstract:

Classification of network traffic can be used in different fields, such as security control, application prioritization, and intrusion detection. Information passing through operational networks is large, so classification requires expensive equipment and spaces for storing large amounts of data. This problem can be solved by using sampling - selection ofrandom packets from the traffic. That approach can reduce requirements of performance and storage of data for analyze. Limited resources and high capacity of current networks preventdeployment of the classification solutions. For solving classification tasks in this work is used machine learning. There are several problemsin deployment of clas-sification solution: A majority of machine learning algorithms works only with packet datasets, which requires using additional (often, expen-sive) equipment. Impact of packet sampling on the traffic classification still unknown, although networks operators often use it. The purpose of the work is to solve outlined problems by using machine learning methods. To find out the impact of packet sampling on algorithms performance, in this work is used the most common classification algorithms: C4.5, SVM, Ada Boost, Naïve Bayes, Bayes Net. Weka API was used for implementation of the algorithms. For the task traffic was captured for training the algorithms and classification with different sample rates. The following applications were selected for classification: web, p2p, ftp, mail. The experiment shows, that in general, sampling caused deterioration of performance of all algorithms. For applications, which use flows with high duration and size, the impact of sampling is the lowest. Metrics from the field of informational retrieval, such as precision and recall is used for evaluation of performance. In the work is shown dependency of recall and precision from sample rate. Was shown, that sampling causes increasing of type I and type II errors.

Pages: 14-24

References

Internet Assigned Numbers Authority (IANA), http://www.iana.org/assignments/port-numbers, as of August 12, 2008.
Karagiannis T., Broido A., Faloutsos M. Transport layer identification of P2P traffic. In Proc. of ACM SIGCOMM IMC, August. 2004.
Moore A., Papagiannaki K.Toward the accurate identification of network Applications. In Proc. of PAM Conf., March. 2005.
Moore A., Zuev D. Internet trafficc classification using bayesian analysis techniques. In ACM SIGMETRICS Performance Evaluation Review.ACM. 2005. V.33.P.50-60. 2005.
Sen S., Spatscheck O., Wang D. Accurate, scalable in-network identification of p2p traffic using application signatures,» in Proc. of WWW Conf.. May. 2004.
OpenDPI, the Open Source version of ipoque\'s DPI software, http://www.opendpi.org/.
nDPI, Open and Extensible GPLv3 Deep Packet Inspection Library, http://www.ntop.org/products/ndpi/.
Carela-Espanol V., Barlet-Ros P., Cabellos-Aparicio A., Sole-Pareta J. Analysis of the impact of sampling on NetFlow traffic classification // Comput. Netw. 2011. V. 55.№ 5.P. 1083-1099.
Cohen J.A coefficient of agreement for nominal scales. Educ. and Psychol.Meas. 1960. V. 20.№ 1.P. 37-46.
Karagiannis T., Papagiannaki K., Faloutsos M. BLINC: Multilevel Traffic Classification in the Dark. In Proc. ACM SIGCOMM, Philadelphia, Pennsylvania, USA, August 2005.
Information on See5/C5.0 - RuleQuest Research Data Mining Tools. 2011. [Online]. Available: http://www.rulequest.com/see5-info.html.
Lim T.-S., Loh W.-Y., Shih Y. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms // Machine Learn. 2000. V.40,Is. 3.P. 203-229.
Moore A. W., Zuev D. Internet traffic classification using Bayesian analysis techniques // ACM SIGMETRICS 2005. Banff. Alberta (Canada). June2005. N.Y.: ACM, 2005. P. 50-60.
Platt J. Fast Training of Support Vector Machines using Sequential Minimal Optimization / In B.Schoelkopf,C. Burges, A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. 1998.
Freund Y., SchapireR.E. Decision-theoretic generalization of on-line learning and an application to boosting // Journal of Computer and System Sciences. № 55. 1997.
Weka: http://www.cs.waikato.ac.nz/ml/weka/.