Real-time voice conversion using artificial neural network with rectified linear units

350 rub

Journal Neurocomputers №5 for 2014 г.

Article in number:

Keywords: voice conversion artificial neural networks

Authors:

I. S. Azarov - Ph.D. (Eng.), Belarusian State University of Informatics and Radioelectronics. E-mail: azarov@bsuir.by
М. I. Vashkevich - Assistant, Belarusian State University of Informatics and Radioelectronics. E-mail: vashkevich@bsuir.by
А. А. Petrovsky - Dr.Sc. (Eng.), Professor, Belarusian State University of Informatics and Radioelectronics. E-mail: palex@bsuir.by

Abstract:

The paper presents a voice conversion technique that can be used in real-time applications. The technique is based on transformation of short-time spectral envelopes of speech using artificial neural network with Rectified Linear Units. A special network configuration is used that takes into account temporary speaker states. Speech is represented as instantaneous parameters of the harmonic + noise model. The proposed voice conversion technique is compared to main alternative techniques using objective and subjective measures.

Pages: 18-28

References

Stylianou Y., Cappe O., Moulines E. Continuous probabilistic transform for voice conversion // IEEE Trans. Speech Audio Process. 1998. V. 6. № 2. P. 131-142.
Toda T., Black A.W., Tokuda K. Voice conversion based on maximum likelihood estimation of spectral parameter trajectory // IEEE Trans. Audio, Speech and Language Processing. 2007. V. 15. № 8. P. 2222-2235.
Toda T., Muramatsu T., Banno H. Implementation of computationally efficient real-time voice conversion // Proc. INTERSPEECH. Portland. USA. Sep. 2012.
Peng D., Zhang X., Sun J. Voice conversion based on GMM and artificial neural network // Proc. ICCT. Nanjing. China. Nov. 2010. P. 1121-1124.
Godoy E., Rosec O., Chonavel T. Spectral envelope transformation using DFW and amplitude scaling for voice conversion with parallel or nonparallel corpora // Proc. INTERSPEECH. Florence. Italy. Aug. 2011.
Erro D., Navas E., Hernaez I. Parametric voice conversion based on bilinear frequency warping plus amplitude scaling // IEEE Trans. Audio, Speech and Language Processing. 2013. V. 21. № 3. P. 556-566.
Narendranath M., Murthy H.A., Rajendran S., Yegnanarayana B. Transformation of formants for voice conversion using artificial neural networks // Speech Communication. 1995. V. 16. P. 207-216.
Desai S., Black A.W., Yegnanarayana B., Prahallad B. Spectral mapping using artificial neural networks for voice conversion // IEEE Trans. Audio, Speech and Language Processing. 2010. V. 18. № 5. P. 954-964.
Zeiler M., Ranzato M., Monga R., Mao M., et al. On rectified linear units for speech processing // Proc. ICASP. Vancouver. Canada. May 2013.
Kawaahra H., Nisimura R., Irino T., Morise M., Takahashi T., Banno B. Temporally variable multi-aspect auditory morphing enabling extrapolation without objective and perceptual breakdown // Proc. ICASSP. Taipei. Taiwan. April 2009.
Azarov E., Vashkevich M., Petrovsky A. Instantaneous pitch estimation based on RAPT framework // Proc. EUSIPCO. Bucharest. Romania. Aug. 2012.
Azarov E., Petrovsky A. Real-time voice conversion based on instantaneous harmonic parameters // Proc. ICASSP. Prague. Czech Republic. May 2011.
Osovskiy S. Neyronnye seti dlya obrabotki informatsii. M.: Finansy i statistika. 2002. 344 s.
Bacon S., Grantham D. Modulation masking: effects of modulation frequency, depth, and phase // Journal of acoustical society of America. 1989. V. 85. P. 2575-2580.
Nair V., Hinton G.E. Rectified linear units improve restricted Boltzmann machines // Proc. ICML. Haifa. Israel. June 2010.
Lee K.Y., Zhao Y. Statistical conversion algorithms of pitch contours based on prosodic phrases // Proceedings of the International Conference "Speech Prosody 2004" (SP 2004). Nara. Japan. March 23-26 2004. CD-ROM.