-Gram language model investigation for Russian language modeling

350 rub

Journal Neurocomputers №4 for 2013 г.

Article in number:

Keywords: language modeling speech recognition d-gram language model syntax dependencies

Authors:

M.Yu. Zulkarneev, S.A. Repalov, N.G. Shamraev, D. Edel?

Abstract:

One component of the speech recognition system is a language model. Its role in recognizing is very high. However, n-gram models are most widely used at present, and they do not account the properties of natural languages. This paper attempts to remedy this situation - it presents d-gram language model and its various extensions to model the Russian language. They allow to include information about syntactic dependencies between words in language model. In order to get these dependencies it-s necessary to construct the tree of syntactic dependencies for sentence, which is generated using the dependency parser MaltParser. The POS (Part-of-Speech) parser program TreeTagger is used to obtain information about the grammatical characteristics of the words. The factor models formalism is used to implement the d-gram models that generalizes n-gram language models. This model uses additional factors of words, as some characteristic word features. The distinctive feature of the factor model is that the factors may represent completely different data. Besides, the model supports a generalized parallel back-off, that means it has the ability to randomly select factors during the back-off, in contrast to the standard n-gram model where last word is dropped. The database of texts in Russian was used in experiments to verify the proposed model. Two types of experiments were performed: 1) testing baseline 3-gram language model, 2) testing d-gram language models which use «head» words as factors. The experimental results show that the model «H1 factor» is much better than the model «H0 factor». In addition, it gives much better result than the baseline 3-gram model (perplexity value decreases by 54%). The addition of H2 factor improves model by 4.5% additionally.

Pages: 14-17

References

Damerau F. Markov models and linguistic theory. Mouton. 1971.
Pak A., Paroubek P. Text representation using dependency tree subgraphs for sentiment analysis // Lecture Notes in Computer Science. 2011. V. 6637/2011. P. 323-332.
Kasevich V.B. Struktura predlozhenija. EHlementy obshhejj lingvistiki. M.: Nauka. 1977. S. 91-92.
Henderson J. Novel Speech Recognition Models for Arabic // Johns Hopkins Summer Workshop, 2002.
Sharoff S., Nivre J. The proper place of men and machines in language technology: Processing Russian without any linguistic knowledge // Proc. Dialogue 2011. Russian Conference on Computational Linguistics.
Schmid H. Probabilistic part-of-speech tagging using decision trees // Proceedings of International Conference on New Methods in Language Processing. Manchester, UK. 1994.