Autoimatic system for categorization of websites for blocking web pages with inappropriate content

350 rub

Journal Highly available systems №3 for 2013 г.

Article in number:

Keywords: protection against information classification of web-pages web-site analysis

Authors:

D.V. Komashinskiy - Post-graduate Student, St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS). E-mail: komashinskiy@comsec.spb.ru
I.V. Kotenko - Head Laboratory, St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS). E-mail: ivkote@comsec.spb.ru
A.A. Chechulin - Research Scientist, St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS). E-mail: chechulin@comsec.spb.ru
A.V. Shorov- Research Scientist, St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS). E-mail: ashorov@comsec.spb.ru

Abstract:

Today, the Internet is one of the main sources of information for a large number of people. However, access to some of this information may be undesirable for certain groups of people. In particular, it is important to restrict access to certain kinds of information for children. The solution to this problem can be achieved by categorizing the sources of information and making decisions about blocking access to them. To solve the problem of categorizing web pages, the authors have developed a software automated system for categorizing websites. The developed system uses the methods of machine learning and data mining for the analysis of web sites and their classification to certain category. One of the most important stages of implementing the categorization system is to collect data for training of classifiers. This stage involves a direct download of web page content using its URL, the extraction of text from tags and collecting the meta information (the number of tags, images, etc.). The next step is to train the combined classifier. This approach is based on the use of machine learning techniques to produce basic classifiers and then to create a combined classifier used for the categorization of web pages. The training is based on the following structural components of web pages: (1) the text of a web page that is visible to the user, (2) the text which is contained in the tags most commonly used to highlight the main theme of the site, (3) URL of the website. The Internet consists of web sites in different languages, which makes the above approach to training classifiers on the textual information inefficient, because this makes it necessary to create and update the databases with textual information in different languages. To solve the problem of categorizing websites in foreign languages we proposed to use a translation of website textual content from the original language to the language used to train the classifiers. The experiments on the test set showed a higher accuracy of categorization of websites using the combined model on the following categories: adult, casino, cigarettes, cigars, dating, marijuana, drugs, wine and weapon.

Pages: 119-127

References

Komashinskij D.V., Kotenko I.V., Chechulin A.A. Kategorirovanie veb-sajtov dlya blokirovaniya veb-stranicz s nepriemlemy'm soderzhimy'm // Sistemy' vy'sokoj dostupnosti. 2011. № 2. S. 102-106.
Zozulya Ju.V., Kotenko I.V. Blokirovanie Web-sajtov s nepriemlemy'm soderzhimy'm na osnovanii vy'yavleniya ix kategorij // Materialy' mezhdunarodnoj konferenczii - RusKripto-2010?. 2010.
Han J., Kamber M. Data Mining: Concepts and Techniques. Elsevier, Morgan Kaufman. 2006.
Cooley R., Mobasher B., Srivastava J. Web Mining: Information and Pattern Discovery of the World Wide Web // Proceedings of the 9th International Conference on Tools with Artificial Intelligence. 1997. P. 558-567.
Qi X., Davison B.D. Web Page Classification: Features and algorithms // ACM Computing Surveys (CSUR). 2009.
Kuzneczov R.F. Klassifikator veb-stranicz na baze SVM-Multiclass // Trudy' ROMIP. 2006.
Kleinberg J.M., Kumar R., Raghavan P., Rajagopalan S., Tomkins A.S. The Web as a Graph: Measurements, Models, and Methods // Lecture Notes in Computer Science. Springer. 1999. V. 1627. P. 1-17.
Kuncheva L. Combining Pattern Classifiers: Methods and Algorithms. Wiley Interscience. 2004.
Shibu S., Vishwakarma A., Bhargava N. A combination approach for Web Page Classification using Page Rank and Feature Selection Technique // International Journal of Computer Theory and Engineering. 2010. V. 2. № 6. P. 897-900.
Patil A., Pawar B. Automated Classification of Web Sites using Naive Bayessian Algorithm // Proceedings of the International Multi-Conference of Engineers and Computer Scientists. 2012. V. 1.
Jsoup. Java HTML Parser. http://jsoup.org/
RapidMiner, http://rapid-i.com/content/view/181/190/
Promt Translator. http://www.promt.com/
Yandex.Translate API. http://api.yandex.com/translate/