Radiotekhnika
Publishing house Radiotekhnika

"Publishing house Radiotekhnika":
scientific and technical literature.
Books and journals of publishing houses: IPRZHR, RS-PRESS, SCIENCE-PRESS


Тел.: +7 (495) 625-9241

 

Improvements of the algorithm of clustering of the news stream of text messages

Keywords:

Yu.E. Gapanyuk – Ph. D. (Eng.), Associate Professor, Department «Information Processing and Control Systems», Bauman Moscow State Technical University
E-mail: gapyu@bmstu.ru
S.V. Chernobrovkin – Post-graduate Student, Department «Information Processing and Control Systems»,  Bauman Moscow State Technical University
E-mail: sergey.chernobrovkin@inbox.ru
M.P. Myalkin – Post-graduate Student, Department «Information Processing and Control Systems»,  Bauman Moscow State Technical University
E-mail: maxmyalkin@gmail.com
A.V. Opryshko – Post-graduate Student, Department «Information Processing and Control Systems»,  Bauman Moscow State Technical University
E-mail: alexopryshko@yandex.ru
I.I. Latkin – Post-graduate Student, Department «Information Processing and Control Systems»,  Bauman Moscow State Technical University
E-mail: igor.latkin@outlook.com
A.V. Leontiev – Post-graduate Student, Department «Information Processing and Control Systems»,  Bauman Moscow State Technical University
E-mail: alekseyl@list.ru
G.A. Ozhegov – Post-graduate Student, Department «Information Processing and Control Systems»,  Bauman Moscow State Technical University
E-mail: grigory@ozhegov.name


In the algorithm of clustering of the news stream of text messages, introduced in paper [1], TF-IDF distance metric shows the best results. However, the proposed model is a statistical one. It is based only on the frequency of occurrences of words in texts, so despite the fact that it shows acceptable for the task results, it can not «understand» the semantics of the sentence, and therefore will not work properly for semantically similar texts, in which almost no common words. Another problem is that the clustering time is proportional to the number of already clustered news, which is unacceptable for online clustering.
Tomas Mikolov and a group of Google researchers proposed a model of vector representation of words – Word2Vec. The main idea of the model is that each word corresponds to a certain vector so that the vectors of semantically close words are located not far away in the vector space, and vectors of semantically distant words are located far away in the vector space.
We tried to use the metric Word Mover's Distance (WMD), which uses Word2Vec. The quality of this approach appears about three percent worse than the model TF-IDF quality. As the resulting metric will use the arithmetic mean of WMD and TF-IDF distance.
To speed up the online clustering the LSH method is applied. As hash function MinHash can be used. The generalized improved algorithm of clustering of the news stream of text messages Alg’ develops an algorithm Alg based on metagraph approach presented in paper [1].
The algorithm Alg’ updates step 1 of an algorithm Alg. The news are selecting for clustering based on LSH method with MinHash. This reduces the time of clustering. As the metric Dist is used the arithmetic mean of the WMD from the news headlines and cosine distance vectors TF-IDF from the news content, that allows to take into account the semantics of clustering news.
Improving of the quality of clustering is about 2.7%. The time of clustering is reduced approximately four times.

References:
  1. Gapanyuk Yu.E., Chernobrovkin S.V., Latkin I.I., Leont'ev A.V., Ozhegov G.A., Opry'shko A.V., Myalkin M.P. Algoritm klasterizaczii novostnogo potoka tekstovy'x soobshhenij // Informaczionno-izmeritel'ny'e i upravlyayushhie sistemy'. 2017. № 10. S. 64−72.
  2. Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality // Advances in neural information processing systems. 2013. P. 3111−3119.
  3. Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger. From Word Embeddings To Document Distances // Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML'15). 2015. V. 37. P. 957−966.
  4. Charikar M. Similarity estimation techniques from rounding algorithm // ACM Symposium on Theory of Computing. 2002. P. 380−388.
  5. Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma. Detecting Near-Duplicates for Web Crawling // Proceedings of 16th international conference on World Wide Web (WWW-07). 2007. P. 141−150.
  6. Shrivastava A., P. Li. In Defense of MinHash Over SimHash // Proceedings of 17-th International Conference on Artificial Intelligence and Statistics (AISTATS). 2014. P. 886−894.
  7. Hongya Wang, Jiao Cao, LihChyun Shu, Davood Rafiei. Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis // Proceedings of 22nd ACM international conference on Information & Knowledge Management (CIKM '13). 2013. P. 1969−1978.
  8. Chernen'kij V.M., Terexov V.I., Gapanyuk Yu.E. Struktura gibridnoj intellektual'noj informaczionnoj sistemy' na osnove metagrafov // Nejrokomp'yutery': razrabotka, primenenie. 2016. № 9. S. 3−14.

© Издательство «РАДИОТЕХНИКА», 2004-2017            Тел.: (495) 625-9241                   Designed by [SWAP]Studio