System of speech recognition for Russian language, using deep neural networks and finite state transducers

350 rub

Journal Neurocomputers №10 for 2013 г.

Article in number:

Keywords: finite state machines deep neural networks speech recognition converters Russian language modeling acoustics

Authors:

M.Yu. Zulkarneev - Ph.D. (Phys.-Math.), Senior Research Scientist, SRI «Spetsvuzavtomatyka». E-mail: zulkarneev@mail.ru
S.А. Penalov - Ph.D. (Phys.-Math.), SRI «Spetsvuzavtomatyka». E-mail: s.repalov@niisva.org
N.G. Shamraev - Research Scientist, SRI «Spetsvuzavtomatyka». E-mail: ncam1977@yahoo.ru

Abstract:

In speech recognition systems Hidden Markov Models (HMM) with Gaussian mixtures (GMM) are widely used for modeling the acoustic signal and the distribution of feature vectors to states. However, these models have the following disadvantages: first, they inaccurately model the context dependence, because the relationship between states is determined by a single number - the probability of transition between states, and secondly, the GMMs are not efficient at modeling the data located on a nonlinear set or near it. In this work, we propose to use deep neural networks (DNN) to model the acoustic signal. The use of DNNs allows overcoming the drawbacks of HMM+GMM approach. The paper describes in detail the procedure of DNN training, which is based on the representation of DNN as a sequence of restricted Boltzmann machines. The DNNs are used to calculate the probabilities of states for the current observation. These probabilities are then used for full-text Russian speech recognition with the help of finite state transducers (FST). The proposed method is based on the idea of representation of feature vector sequence as a sequence of characters, then it could fit as an input for FS transducer. Acoustic information provided by the acoustic probabilities, can be accounted in the finite-state transducer as transition weights. Then the acoustic information can be represented in the same form as the linguistic information, i.e. in the form of a finite-state transducer. In this case, the problem of finding the optimal hypothesis is reduced to the problem of design of the finite-state transducer, which would represent recognition network, and finding the shortest path in this FST. The design of FS transducer, equivalent to the recognition network, is reduced to the imposal of series of constraints on the FST, which contains acoustic information. This operation is performed by the composition of the following transducers:  «state-phoneme» transducer specifies a scheme for constructing phonemes from the states;  «phoneme-word» transducer specifies the phonetic transcription of words;  «words-words» transducer specifies the language model. The experimental verification of the proposed method is described in article. The description of a practical implementation of the system is provided, as well as speech base on which experiments are conducted. The experiments show that the proposed method of recognition can improve the recognition accuracy with the use of deep neural networks. In addition, the use of finite-state machines (transducers) makes it possible to create easily the recognition networks of required configuration, including other sources of information, using standard tools (libraries) for FST.

Pages: 40-46

References

Rabiner L.R. A tutorial on hidden Markov models and selected applications in speech recognition // Proceedings of the IEEE. 1989. V. 77. № 2. P. 257-285.
Graves A., Fernández S., Schmidhuber J. Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. Artificial Neural Networks: Formal Models and Their Applications // ICANN 2005. Lecture Notes in Computer Science. 2005. V. 3697. P. 799-804.
Hochreiter S., Schmidhuber J. Long Short-Term Memory // Neural Computation. 1997. V. 9. № 8. P. 1735-1780.
Hinton G., Deng Li, Yu Dong, Dahl G., Mohamed Abdel-rahman, Jaitly Navdeep, Senior A.W., Vanhoucke V., Nguyen P., Sainath T., Kingsbury B. Deep Neural Networks for Acoustic Modeling in Speech Recognition // Signal Processing Magazine. 2012.
Hinton G.E., Osindero S., Teh Y. A fast learning algorithm for deep belief nets // Neural Computation. 2006. V. 18. P. 1527-1554.
Young S.J. The HTK Book. Version 3.4. March 2006.
Hinton G.E. Training products of experts by minimizing contrastive divergence // Neural Computation. 2002. V. 14. № 8. P. 1711-1800.
Bergstra J., Breuleux O., Bastien F., Lamblin P., Pascanu R., Desjardins G., Turian J., Warde-Farley D., Bengio Y. Theano: A CPU and GPU Math Expression Compiler // Proceedings of the Python for Scientific Computing Conference (SciPy). Austin. June 30-July 3, 2010.
Allauzen C., Riley M., Schalkwyk J., Skut W., Mohri M. OpenFst: a general and efficient weighted finite-state transducer library // Proceedings of the 12th International Conference on Implementation and Application of Automata. 2007. P. 11-23.
Zulkarneev M. Ju., Repalov S. A., Shamraev N. G., E'del' D. A. Issledovanie nejropodobnoj d-grammnoj modeli dlya modelirovaniya russkogo yazy'ka // Nejrokomp'yutery': razrabotka, primenenie. 2013. № 4. S. 14.
Savel'ev A.V. Obshhaya teoriya samoorganizaczionnogo nejroupravleniya // Nejrokomp'yutery': razrabotka, primenenie. 2013. № 5. S. 3-13.