350 rub
Journal Information-measuring and Control Systems №3 for 2017 г.
Article in number:
Automated topic monitoring based on event detection in text stream
Authors:
A.M. Andreev - Ph.D. (Eng.), Associate Professor, Department of Computer Systems and Networks, Bauman Moscow State Technical University
E-mail: arkandreev@gmail.com
D.V. Berezkin - Ph.D. (Eng.), Associate Professor, Department of Computer Systems and Networks, Bauman Moscow State Technical University
E-mail: berezkind@bmstu.ru
I.A. Kozlov - Master (Computer Sciences), Junior Research Scientist, Department of Computer Systems and Networks, Bauman Moscow State Technical University
E-mail: kozlovilya89@gmail.com
Abstract:
The article deals with the task of topic monitoring. To address this problem we propose an approach based on detecting events that are relevant to specified topics in a stream of text documents. We consider event detection as a clustering task that consists in partitioning all documents into groups in such a way that documents within each group describe a certain event.
As each document should be processed immediately after retrieving from the source, we utilize incremental clustering method. It is based on analyzing similarity between a newly downloaded text message and events that have been previously constructed. An event that is closest to the document is identified, and then the document is either assigned to this event (if the value of similarity between them exceeds the threshold) or is designated as a first document of a new event cluster.
Similarity between a document and an event is calculated via multi-criteria comparison of document-s and event-s models. The models contain components reflecting major aspects of events such as: a vector of keywords, sets of names of relevant persons, organizations and geographic locations, a set of relevant topics, time interval of event occurrence and other aspects. Taking into account all these aspects allows adjusting the proposed event detection method tovarious domains.
Comparison of the documentand the eventis performed by calculating distances between the respective components of their models. Components are compared via various similarity measures including cosine similarity andJaccard index.The result of the comparison is a multidimensional vector. Each component of the vector represents the distance between the document and the event in terms of a certain aspect. The vector is then used to calculate the overall similarity valuewhich is done with the application of a Support Vector Machine (SVM).The usage of this machine learning method provides flexible adjustment of the proposed event detection methodthat allows detecting events on various localization levels.
The paper contains the results of experiments conducted to estimate the quality of event detection depending on the set of criteria taken into account. The experiments showed that the usage of full set of proposed aspects allows achieving higher quality in comparison with the usage of its subsets.
Pages: 49-60
References
- Landeh D.V., Furashev V.N., Brajjchevskijj S.M., Grigorev A.N. Osnovy modelirovanija i ocenki ehlektronnykh informacionnykh potokov. Kiev: Inzhiniring. 2006. 176 s.
- Andreev A.M., Berezkin D.V., Kozlov I.A., Simakov K.V. Podkhod k avtomatizirovannomu kontrolju raboty sistemy izvlechenija dannykh s veb-sajjtov // Informatika i ee primenenija. 2013. T. 7. № 3. S. 2-13.
- Andreev A.M., Berezkin D.V., Kozlov I.A., Simakov K.V. Mnogokriterialnyjj metod vyjavlenija nechetkikh dublikatov v potoke tekstovykh soobshhenijj // Sistemy i sredstva informatiki. 2015. T. 25. № 1. S. 34-53.
- Allen J.F., Ferguson G. Actions and events in interval temporal logic // Journal of logic and computation. 1994. T. 4. № 5. S. 531-579.
- Yang Y., Pierce T., Carbonell J. A study of retrospective and on-line event detection // Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM. 1998. S. 28-36.
- Raymond Y., Abdallah S. The Event Ontology. Rezhim dostupa: http://motools.sourceforge.net/event/event.html (data obrashhenija 02.02.2017).
- Landeh D.V., Brajjchevskijj S.M., Grigorev A.N., Darmokhval A.T., Radeckijj A.B. Vyjavlenie novykh sobytijj iz potoka novostejj // Kompjuternaja lingvistika i intellektualnye tekhnologii. Trudy mezhdunarodnojj konferencii «Dialog-2007». M.: 2007. S. 349-352.
- Aggarwal C.C., Subbian K. Event detection in social streams // Proceedings of the 2012 SIAM international conference on data mining. Society for Industrial and Applied Mathematics. 2012. S. 624-635.
- Yang Y., Zhang J., Carbonell J., Jin, C. Topic-conditioned novelty detection // Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2002. S. 688-693.
- Kondratev M.E. Analiz metodov klasterizacii novostnogo potoka // EHlektronnye biblioteki: perspektivnye metody i tekhnologii, ehlektronnye kollekcii: Trudy VIII Vseros. nauch. konf. (RCDL-2006). JAroslavl. 2006. S. 108-114.
- Prishhepa C.V. Obzor metodov ehkstragirovanija sobytijj «iz potoka novostejj» // Registracija, khranenie i obrabotka dannykh. 2015. T. 17. № 4. S. 38-48.
- Zhao Q., Mitra P., Chen B. Temporal and information flow based event detection from social text streams //AAAI. 2007. T. 7. S. 1501-1506.
- Dobrov B.V., Pavlov A.M. Issledovanie kachestva bazovykh metodov klasterizacii novostnogo potoka v sutochnom vremennom okne // EHlektronnye biblioteki: perspektivnye metody i tekhnologii, ehlektronnye kollekcii: Trudy XII Vseros. nauch. konf. (RCDL-2010). Kazan. 2010. S. 287-295.
- Soloshenko A.N., Orlova JU.A., Rozaliev V.L. Avtomaticheskoe vydelenie sjuzhetov i tem iz potoka novostnykh soobshhenijj // Izv. Volgogradskogo gosudarstvennogo tekhnicheskogo universiteta. 2015. № 2 (157). S. 83-90.
- Li Z., Wang B., Li M., Ma W.Y. A probabilistic model for retrospective news event detection // Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. 2005. S. 106-113.
- Ahmed A., Ho Q., Eisenstein J., Xing E.P., Smola A.J., Teo C.H. Unified analysis of streaming news // Proceedings of the 20th international conference on World wide web. ACM. 2011. S. 267-276.
- Aggarwal C.C., Philip S.Y. On clustering massive text and categorical data streams // Knowledge and information systems. 2010. T. 24. № 2. S. 171-196.
- Potemkin A.V., Borodashhenko A.JU. Algoritm dinamicheskojj klasterizacii soobshhenijj sredstv massovojj informacii seti internet po sjuzhetnym linijam // Human Progress. 2016. T. 2. № 8. S. 1-9.
- How Sphinx relevance ranking works. Rezhim dostupa: http://sphinxsearch.com/blog/2010/08/17/how-sphinx-relevance-ranking-works/ (data obrashhenija 02.02.2017).