Technology for distributed crawling and analysis of big data from social media

350 rub

Journal Dynamics of Complex Systems - XXI century №3 for 2013 г.

Article in number:

Keywords: social networks big data crawling events monitoring graph sampling livejournal vkontakte

Authors:

A.V. Yakushev - Junior Researcher Scienyist, National Research University of Information Technologies, Mechanics and Optics. E-mail: andrew.yakushev@yandex.ru

Abstract:

This paper presents technology for crawling and further analysis of data from social media. The system uses concept of Big Data and is based on the MapReduce model for distributed computations. Social media have several features: social media sites impose restrictions on the number of the requests per time unit and have unique interfaces for data crawling. This reflects on the requirements to the system: efficient network using goes to the second place and to the fore come scalability and effective managing of crawled big data. Crawler consists of the following big units: data fetcher, data parser and module for building URL queue. Building of URL queue implemented based on the Apache Hadoop framework which implements MapReduce model. This allows to aggregate statistics on all data which is used by crawling policies. Data fetchers and parsers implemented as a separate applications which can work on separate machines outside Hadoop cluster. Synchronization between machines is performed by using Apache ZooKeeper library. In addition to the Hadoop methods for data analysis NoSQL database Apache Hive can be used. Crawling polices determines URL and their order of crawling. They are implemented as sequences of MapReduce jobs. Social media monitoring or the task of finding new events is researched. For optimization of monitoring it is required to split M accesses to the social media between N nodes of social media and determine the time of access to minimized the delay function. To determine access quota homogeneous Poisson model is used which allows finding new events 15% faster. To determine time of access periodic non-homogeneous Poisson model is used, which allows finding new events 15% faster. Both these models gives 25% improvement on the data crawled from Livejournal. Unbiased sampling from social media is complicated because it is based on the sampling from graph induced by connections between social media entities. As a result this leads to the biased sampling in which high degree nodes dominate. To obtain sampling with unbiased degree distribution Metropolis-Hastings algorithm can be used. For researching of graph sampling methods data from Vkontakte social network is used. Analysis of links between users of Vkontakte social network reveals that 85% of them can be explained by the fact of familiarity of users in real world. Several features of familiarity of in real world were used: users studied in the same school, users studied in the same university and users have "a lot" of common friends. Topological feature is based on the Jaccard coefficient.

Pages: 51-55

References

Kaplan A.M., Haenlein M. Users of the world, unite! The challenges and opportunities of social media // Business Horizons. 2010. P. 61 (Prover' ssy'lku ot Iv.)
Lynch C. Big data: How do your data grow - // Nature. 2008. T. 455. № 7209. S. 28-29.
Cho J., Garcia-Molina H. Effective page refresh policies for Web crawlers // ACM Transactions on Database Systems (TODS). 2003. T. 28. № 4. S. 390-426.
Semenov A.V., Buxanovskij A.V. Metrologicheskij analiz v soczial'ny'x setyax // Izvestiya vuzov. Priborostroenie. 2011. T. 54. № 3. S. 85-86.
Lämmel R. Google-s MapReduce programming model-Revisited // Science of computer programming. 2008. T. 70. № 1. S. 1-30.
Sia K.C., Cho J., Cho H.K. Efficient monitoring algorithm for fast news alerts // Knowledge and Data Engineering, IEEE Transactions on. 2007. T. 19. № 7. S. 950-961.
Leskovec J., Faloutsos C. Sampling from large graphs // Proc. of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2006. S. 631-636.
Gjoka M. et al. Walking in Facebook: A case study of unbiased sampling of OSNs // INFOCOM. 2010. Proceedings IEEE. 2010. S. 1-9.
Real R., Vargas J. M. The probabilistic basis of Jaccard's index of similarity // Systematic biology. 1996. T. 45. № 3. S. 380-385.
Mityagin S.A. i dr. Informaczionnaya sistema modelirovaniya i analiza rasprostraneniya narkomanii v obshhestve na mikrourovne // Vestnik ITARK. Problemy' informatizaczii. 2012. № 1(3). S. 34-40.