Creating text corpora for special purposes on the basis of extended TXM platform

Journal Highly available systems №3 for 2018 г.

Article in number:

Type of article: scientific article

DOI: 10.18127/j20729472-201803-13

UDC: 81’33: 519.76

Keywords: case linguistics automatic morphological analysis the TXM platform the analysis of correspondences specificity the identification of extremist texts

Authors:

A.M. Lavrentiev – Ph.D.(Philol.), CNRS & ENS de Lyon (France)

E-mail: alexei.lavrentev@ens-lyon.fr

I.V. Smirnov – Ph.D.(Phys.-Math.), Head of Department, FRC «Computer Science and Control» of RAS (Moscow) E-mail: ivs@isa.ru

M.I. Suvorova – Research Scientist, FRC «Computer Science and Control» of RAS (Moscow)

E-mail: ananyeva@isa.ru

F.N. Solov'ev – Research Scientist, Institute of Physical and Technical Informatics (Protvino)

E-mail: the0@yandex.ru

A.I. Fokina – Student, HSE (Moscow)

E-mail: aifokina@edu.hse.ru

A.M. Chepovskiy – Dr.Sc.(Eng.), Professor, HSE (Moscow) E-mail: achepovskiy@hse.ru

Abstract:

TXM platform suggests a wide range of corpus analysis capabilities including correspondence analysis, clusterization, lexical table construction, parametrized subcorpus selection. The default structural unit of analysis for the TXM platform is a token. However it is possible to supply each token with a number of features enabling more sophisticated, complex while flexible corpus analysis. The only extension available by default is the TreeTagger augmenting TXM platform with automated token morphological analysis capability. In this work we present a number of tools for even more extensive and complex corpus analysis relying both on our previously developed tools as well as on publicly available tools.

Pages: 76-81

References

Anan'eva M.I., Kobozeva M.V., Solov'ev F.N., Polyakov I.V., Chepovskiy A.M. O probleme vyyavleniya ekstremistskoy napravlennosti v tekstakh // Vestnik Novosibirskogo gos. un-ta. Ser.: Informatsionnye tekhnologii. Novosibirsk: Novosibirskiy natsional'nyy issledovatel'skiy gosudarstvennyy universitet. 2016. T. 14. № 4. S. 5−13.
Anan'eva M.I., Devyatkin D.A., Kobozeva M.V., Smirnov I.V., Solov'ev F.N., Chepovskiy A.M. Issledovanie kharakteristik tekstov protivopravnogo soderzhaniya // Proc. of the Institute for Systemic Analysis of RAS. M.: FRC «Computer Science and Control» of RAS. 2017. T. 67. № 3. S. 86−97.
Bolkhovityanov A.V., Chepovskiy A.M. Metody avtomaticheskogo analiza slovoform // Informatsionnye tekhnologii. 2011. № 4(176). S. 24−29.
Zaliznyak A.A. Grammaticheskiy slovar' russkogo yazyka. M.: Russkiy yazyk. 1977. 879 s.
Chepovskiy A.M. Informatsionnye modeli v zadachakh obrabotki tekstov na estestvennykh yazykakh. Izd. 2-e, pererab. M.: Natsional'nyy otkrytyy universitet «INTUIT». 2015.
Benzécri J.-P. L’analyse des données: l’analyse des Correspondances. V. 2. 2nd ed. Paris: Dunod. 1979.
Egorova E., Chepovskiy A., Lavrentiev A. A structural pattern based method for automated morphological analysis of word forms in a natural language // Journal of Mathematical Sciences. M.: Plenum Publishers. 2016. V. 214. № 6. P. 802−813.
Heiden S. The TXM Platorm: Building Open-Source Textual Analysis Sofware Compatile with the TEI Encoding Scheme // 24th Pacific Asia Conference on Language, Information and Computation – PACLIC24 / Ed. by R. Otoguro, K. Ishikawa, H. Umemoto, K. Yoshimoto and Y. Harada. Institute for Digital Enhancement of Cognitive Development. Waseda University, Sendai, Japan. 2010. P. 389−398. URL = htp://halshs.archiies-ouiertes.fr/halshs-00549764.
Lafon P. Sur la variabilité de la fréquence des formes dans un corpus // Mots. 1980. № 1. P. 127−165.
Lê S., Josse J., & Husson F. FactoMineR: an R package for multivariate analysis // Journal of statistical software. 2008. № 25(1). P. 1−18.
Schmid H. Probabilistic Part-of-Speech Tagging Using Decision Trees // Proc. of International Conference on New Methods in Language Processing. Manchester, UK. 1994. URL = http://www.cis.uni-muenchen.de/sschmid/tools/TreeTagger/data/tree-tagger1.pdf.

Date of receipt: 3 августа 2018 г.