Journal Achievements of Modern Radioelectronics №4 for 2021 г.
Article in number:
Method for automated data processing based on topic modeling
Type of article: scientific article
DOI: https://doi.org/10.18127/j20700784-202104-08
UDC: 004.8
Authors:

А.А. Ivanov¹, Е.К. Vedernikov², А.А. Smirnov³

1,2 LTD «STC» (St. Petersburg, Russia)

3  Military Telecommunications Academy (St. Petersburg, Russia)

Abstract:

The problem of finding groups of documentary exchange subscribers with common interests by the content of the list of messages represented by a triplet dj =(sj, rj, cj), djD, where sj is the sender of the message, rj is the recipient of the message (sj, rjA), cj is the content of the message (text), A is the set of subscribers, has been formulated. It is shown that the problem belongs to the class of natural language processing problems and can be effectively solved using topic modeling methods. The paper presents a technique for applying the theory of additive regularization of topic models for statistical analysis of the messages content in order to determine their belonging to topics t from a finite set T with a given cardinality. The matrix Θ of the probabilistic distribution of message topics formed by thematic modeling means is used as input parameters of the group identification algorithm. This algorithm forms a matrix Ζ of probabilities of subscribers belonging to topics. Groups are formed from elements of rows of matrix Ζ with probabilities greater than 1/|T|. The efficiency is estimated by an internal criterion by comparing the obtained matrix Ζ with the matrix of the real distribution Ζ* of the control sample of messages. An example of application of the technique for |D|=18, |A|=5, |T|= 6 is given. The efficiency of soft clustering of subscribers by groups was 0,91, which allows us to conclude that it is advisable to apply the technique for the specified class of problems.

Pages: 57-62
For citation

Ivanov А.А., Vedernikov Е.К., Smirnov А.А. Method for automated data processing based on topic modeling. Achievements of modern radioelectronics. 2021. V. 75. № 4. P. 57–62. DOI: https://doi.org/10.18127/ j20700784-202104-08 [in Russian]

References
  1. Ivanov A.A., Kudryavtsev A.M., Smirnov A.A. Kontseptual'nye problemy informatsionno-analiticheskoy raboty v sovremennom voennom protivostoyanii. Voennaya mysl'. 2020. № 9. S. 79–85. [in Russian]
  2. Wang W., Kennedy R., Lazer D., Ramakrishnan N. Growing pains for global monitoring of societal events. Science. 2016. V. 353 (6307). P. 1502–1503.
  3. Wiil U. Counterterrorism and Open Source Intelligence. Lecture Notes in Social Networks. Springer. 2011.
  4. Dhar V. Data Science and Prediction. Communications of the ACM. 2013. V. 56. № 1. P. 64–73.
  5. Vorontsov K.V. Veroyatnostnoe tematicheskoe modelirovanie: teoriya, modeli,algoritmy i proekt  Big ARTM.URL: http://www.machinelearning.ru/wiki/images/d/d5/Voron17survey-artm.pdf. [in Russian]
  6. Lukashevich N.V. Tezaurusy v zadachakh informatsionnogo poiska. M.: Izd-vo Moskovskogo universiteta. 2011. [in Russian]
  7. Eisenstein J., Ahmed A., Xing E.P. Sparce additive generative models of text. ICML’11. 2011. P. 1041–1048.
  8. Shang J., Liu J., Jiang M., Ren X., Voss C.R., Han J. Automated phrase mining from massive text corpora. CoRR. 2017. V. abs/1702.04457.
  9. Vorontsov K.V. Additivnaya regulyarizatsiya tematicheskikh modeley kollektsiy tekstovykh dokumentov. Doklady RAN. 2014. T. 456. № 3. S. 268–271. [in Russian]
  10. Vorontsov K.V., Potapenko A.A. Modifikatsiya EM-algoritma dlya veroyatnostnogo tematicheskogo modelirovaniya. Mashinnoe obuchenie i analiz dannykh. 2013. T. 1. № 6. S. 657–686. [in Russian]
  11. Grekhem R., Knut D., Patashnik O. Konkretnaya matematika. M.: Mir. 1998. [in Russian]
  12. Sayt proekta BigARTM. URL: https://bigartm.org. [in Russian]
  13. Programmnaya realizatsiya i iskhodnye dannye primera. URL: https://github.com/EgoVed/topic_model. [in Russian]
Date of receipt: 10.03.2021
Approved after review: 24.03.2021
Accepted for publication: 01.04.2021