Quality Evaluation of Reverberant Speech Based on Deep Learning

Abd El-Moneim, Samia; Saied, Mahmoud; Nassar, M. A.; Dessouky, Moawad I.; Ismail, N.; Saleeb, Adel; El-Fishawy, Adel S.; Abd El-Samie, Fathi E.

doi:10.21608/mjeer.2020.103754

Quality Evaluation of Reverberant Speech Based on Deep Learning

Document Type : Original Article

Authors

Electronics and Electrical Communications Engineering Department, Faculty of Electronic Engineering, Menouf 32951, Menoufia University.

10.21608/mjeer.2020.103754

Abstract

This paper presents an efficient approach for classification of speech signals as reverberant or not. The reverberation is a severe effect encountered in closed room. So, it may affect subsequent processes and deteriorate speech processing system performance. The spectrograms are utilized as images generated from speech signals to be classified with deep convolutional neural networks. Spectrogram and MFCC are used as features to be classified with Long Short Term Recurrent Neural Network (LSTM RNN). Two models are presented and compared. Simulation results up to 100% classification accuracy are obtained. This can help in perform an initial step in any speech processing system that comprises quality level classification.

References

[1] F. E. Abd El-Samie, "Information security for automatic speaker identification", Springer briefs in electrical and computer engineering New York: Springer, 2011.
[2] J. Eaton, N. D. Gaubitch, and P. A. Naylor, “Noise-robust reverberation time estimation using spectral decay distributions with reduced computational cost,” Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process.(ICASSP), pp. 161–165, 2013.
[3] B. Cauchi, H. Javed, T. Gerkmann, S. Doclo, S. Goetze, and P. Naylor, “Perceptual and instrumental evaluation of the perceived level of reverberation,” Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. (ICASSP), 2016.
[4] S. Xie, N. Yan, P. Yu, M. L. Ng, L Wang, Z. Ji, "Deep Neural Networks for Voice Quality Assessment based on the GRBAS Scale", INTERSPEECH, September 8–12, 2016, San Francisco, USA.
[5] Jonathan Dennis, T Dat, and H Li, “Spectrogram image feature for sound event classification in mismatched conditions,” IEEE Signal Processing Letters, vol. 18, no. 2, PP. 130–133, 2011.
[6] Chunlei Zhang, Chengzhu yu, John H.L. Hansen,''An Investigation of Deep Learning Frameworks for Speaker Verification Anti-spoofing, JOURNAL OF LATEX CLASS FILES, August 15, 2016.
[7] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia,and A. Baskurt, "Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks", Springer-Verlag Berlin Heidelberg ICANN 2010, , pp. 154–159, 2010.

[8] y. lukic, c. vogt, o. durr, t. stadelmann, '' Speaker Identification and Clustering Using Convolutional Neural Networks'', IEEE international workshop on machine learning for signal processing, Sept. 13–16, 2016.
[9] S. worral, "Echo and Reverberation", based on an experiment devolved by Texas instrument Inc., 2007 and modified 2011.
[10] A. Krueger and R. Haeb-Umbach, “A model-based approach to joint compensation of noise and reverberation for speech recognition,” in Robust Speech Recognition of Uncertain or Missing Data, Eds. Berlin, Heidelberg: Springer, 2011.
[11] Masashi Unoki and Sota Hiramatsu, “MTF-based method of blind estimation of reverberation time in room acoustics” 16th European Signal Processing Conference (EUSIPCO 2008), Lausanne, Switzerland, August 25-29, 2008.
[12] M. R. Schroeder, "Natural sounding artificial reverberation", Journal of the Audio Engineering Society 10(3): 219-223, 1962.
[13] S. Nilufar, N. Ray, M. K. Islam Molla, ''Spectrogram based features selection using multiple kernel learning for speech/music Discrimination''. University of Alberta, Edmonton, AB, Canada.
[14] Guoshen Yu and Jean-Jacques Slotine, “Audio classification from time frequency texture,” in ICASSP, pp. 1677–1680, 2009.
[15] J. Larsson, "Optimizing text-independent speaker recognition using an LSTM neural network", Master Thesis in Robotics October, 26, 2014.
[16] Xiangang Li, Xihong Wu, "Modeling Speaker Variability Using Long Short-Term Memory Networks for Speech Recognition ", INTERSPEECH , pp. 1086-1090, Sept. 6-10, 2015.
[17] V. Sze, Yu-H. Chen, T. J. Yang, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey", arXiv:1703.09039v2 [cs.CV] 13 August 2017.
[18] O. A.-Hamid, A. Elrahman Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional Neural Networks for Speech Recognition", IEEE/Acm Transactions on Audio,