A Review of Audio-Visual Speech Recognition


  • Thum Wei Seong Applied Electronic and Computer Engineering Cluster Faculty of Electrical & Electronic Engineering, University Malaysia Pahang, 26600 Pekan, Pahang, Malaysia.
  • M. Z. Ibrahim Applied Electronic and Computer Engineering Cluster Faculty of Electrical & Electronic Engineering, University Malaysia Pahang, 26600 Pekan, Pahang, Malaysia.


Audio-Visual Speech Recognition, AudioVisual Data Corpus, Feature Extraction, Model Validation Techniques, Performance Evaluation,


Speech is the most important tool of interaction among human beings. This has inspired researchers to study further on speech recognition and develop a computer system that is able to integrate and understand human speech. But acoustic noisy environment can highly contaminate audio speech and affect the overall recognition performance. Thus, Audio-Visual Speech Recognition (AVSR) is designed to overcome the problems by utilising visual images which are unaffected by noise. The aim of this paper is to discuss the AVSR structures, which includes the front end processes, audio-visual data corpus used, recent works and accuracy estimation methods.


Sawakare, P. A., Deshmukh, R. R. & Shrishrimal, P. P. Speech Recognition Techniques: A Review. 6 (2015), 1693–1698.

Morgan, N. Deep and wide: Multiple layers in automatic speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20 (2012), 7– 13.

Ghadage, Y. H. & Shelke, S. D. Speech to Text Conversion for Multilingual Languages (2016), 236–240.

Kulkarni, D. S., Deshmukh, R. R., Shrishrimal, P. P., Waghmare, S. D. & Science, C. HTK Based Speech Recognition Systems for Indian Regional languages : A Review. (2016).

Tian, C., Ji, W. & Yuan, Y. Auxiliary Multimodal LSTM for Audiovisual Speech Recognition and Lipreading(2017), 1–9.

Islam, M. & Rahman, F. Hybrid Feature and Decision Fusion Based Audio-Visual Speaker Identification in Challenging Environment. Int. J. Comput. Appl. 9 (2010), 9–15.

Potamianos, G., Neti, C., Luettin, J. & Matthews, I. Audio-Visual Automatic Speech Recognition : An Overview. Issues Vis. Audiov. Speech Process (2004), 1–30.

Galatas, G., Potamianos, G. & Makedon, F. Audio-visual speech recognition incorporating facial depth information captured by the Kinect (2012), 2714–2717.

Navarathna, R., Dean, D., Sridharan, S. & Lucey, P. Multiple cameras for audio-visual speech recognition in an automotive environment. Computer Speech and Language 27 (2013), 911–927.

Palecek, K. & Chaloupka, J. Audio-visual speech recognition in noisy audio environments. 2013 36th Int. Conf. Telecommun. Signal Process (2013), 484–487.

Ibrahim, M. Z. & Mulvaney, D. J. Robust geometrical-based lipreading using hidden Markov models. IEEE EuroCon (2013), 2011– 2016.

Ibrahim, M. Z. & Mulvaney, D. J. A lip geometry approach for featurefusion based audio-visual speech recognition. ISCCSP 2014 - 2014 6th Int. Symp. Commun. Control Signal Process. Proc (2014), 644–647.

Oliveira, V. A. & Conci, A. in H. Pedrini, & J. Marques de Carvalho, Workshops of Sibgrapi (2009), 1–2.

Dave, N. Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition. Int. J. Adv. Res. Eng. Technol. 1 (2013), 1–5.

Ittichaichareon, C. Speech recognition using MFCC. Conf. Computer (2012), 135–138.

Hongbing Hu, Stephen. A, Z. Dimensionality reduction methods for HMM phonetic recognition (2010), 4854–4857.

Mohamed, A. et al. Deep belief networks using discriminative features for phone recognition. Acoust. Speech Signal Process. (ICASSP), IEEE Int. Conf (2011). 5060–5063.

Shrawankar, U. & Thakare, V. Feature Extraction for a Speech Recognition System in Noisy Environment: A Study. Comput. Eng. Appl. (ICCEA), Second Int. Conf. 1 (2010), 358–361.

Tripathy, S., Baranwal, N. & Nandi, G. C. A MFCC based Hindi speech recognition technique using HTK Toolkit. 2013 IEEE 2nd Int. Conf. Image Inf. Process. IEEE ICIIP (2013), 539–544.

Luettin, J., Thacker, N. a. & Beet, S. W. Visual speech recognition using active shape models and hidden Markov models. IEEE Int. Conf. Acoust. Speech, Signal Process. 2 (1996), 817–820.

Matthews, I. Features for audio-visual speech recognition. Citeseer (1998).

Patterson, E. K., Gurbuz, S., Tufekci, Z. & Gowdy, J. N. CUAVE: A new audio-visual database for multimodal human-computer interface research. IEEE Int. Conf. Acoust. Speech, Signal Process.2 (2002), II- 2017-II-2020.

M. Z. Ibrahim, “A novel lip geometry approach for audio-visual speech recognition,” Loughborough University (2014).

Katsaggelos, A. K., Bahaadini, S. & Molina, R. Audiovisual Fusion: Challenges and New Approaches. Proc. IEEE 103 (2015), 1635–1653.

Huang, P. Sen, Zhuang, X. & Hasegawa-Johnson, M. Improving acoustic event detection using generalizable visual features and multimodality modeling. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. – Proc (2011). 349–352.

Ibrahim, M. Z., Mulvaney, D. J. & Abas, M. F. Feature-fusion based audio-visual speech recognition using lip geometry features in noisy enviroment. 10 (2015), 17521–17527.

Sarvestani, R. R. & Boostani, R. FF-SKPCCA: Kernel probabilistic canonical correlation analysis. Appl. Intell (2016). 438–454.

Saeed, U. Person identification using behavioral features from lip motion. 2011 IEEE Int. Conf. Autom. Face Gesture Recognit. Work. FG (2011), 155–160.

Morade, S. S. & Patnaik, S. Comparison of classifiers for lip reading with CUAVE and TULIPS database. Optik (Stuttg). 126 (2015), 5753– 5761.

Kambiz Rahbar. Independent-Speaker Isolated Word Speech Recognition Based on Mean-Shift Framing Using Hybrid HMM/SVM Classifier (2010). 156–161.

Makrem, B. Structuring Visual Information for Person Detection in Video : Application to VIDTIMIT database (2016), 233–237.

Soto, P. et al. Single Sample Face Recognition from Video via Stacked Supervised Auto-encoder Single Sample Face Recognition from Video via Stacked Supervised Auto-encoder (2016).

Morade, S. S. Visual Lip Reading using 3D-DCT and 3D-DWT and LSDA. 136 (2016), 7–15.

Foteini Patrona, Alexandros Iosifidis, Anastasios Tefas, Nikolaos Nikolaidis, and I. P. Visual voice activity detection based on spatiotemporal information and bag of words. Int. Conf. Image Process (2015). 2334–2338.

Sagonas, C., Tzimiropoulos, G., Zafeiriou, S. & Pantic, M. A semiautomatic methodology for facial landmark annotation. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work (2013), 896–903.

Naser Damer, Alexander Opel, A. N. Biometric source weighting in multi-biometric fusion : towards a generalized and robust solution (2013).

Ouamane, A., Messaoud, B., Guessoum, A., Hadid, A. & Cheriet, M. Multi scale multi descriptor local binary features and exponential discriminant analysis for robust face authentication. IEEE Int. Conf. Image Process. ICIP (2014). 313–317.

Li, Z., Imai, J. I. & Kaneko, M. Face and expression recognition based on bag of words method considering holistic and local image features. Isc. 10th Int. Symp. Commun. Inf. Technol (2010). 1–6.

Petridis, S. & Pantic, M. Deep complementary bottleneck features for visual speech recognition. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. – Proc (2016), 2304–2308.

Frisky, A. Z. K., Wang, C.-Y., Santoso, A. & Wang, J.-C. Lip-based visual speech recognition system. Secur. Technol. (ICCST), 2015 Int. Carnahan Conf (2015). 315–319.

Kocaguneli, E. & Menzies, T. Software effort models should be assessed via leave-one-out validation. J. Syst. Softw. 86 (2013), 1879– 1890.

Lucey, S., Chen, T., Sridharan, S. & Chandran, V. Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition. IEEE Trans. Multimed. 7 (2005), 495–506.

Stewart, D., Seymour, R., Pass, A. & Ming, J. Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Trans. Cybern. 44 (2014), 175–184.

Pawar, G. S. Realization of Hidden Markov Model for English Digit Recognition. 98 (2014), 98–101.

Huang, J. & Kingsbury, B. Audio-visual deep learning for noise robust speech recognition. IEEE Int. Conf. Acoust. Speech Signal Process (2013). 7596–7599.

Shah, D., Han, K. J. & Narayanan, S. S. Robust Multimodal Person Recognition Using Low-Complexity Audio-Visual Feature Fusion Approaches. Int. J. Semant. Comput. 4 (2010), 155–179.

Ahmed Hussen Abdelaziz, Steffen Zeiler, D. K. Twin-HMM-based audio-visual speech enhancement. Digit. Signal Process (2013), 3726– 3730.

Receveur, S., Scheler, D. & Fingscheidt, T. A turbo-decoding weighted forward-backward algorithm for multimodal speech recognition (2014), 179–192.

Ibrahim, Z. A novel lip geometry approach for audio-visual speech recognition (2014).

Tantithamthavorn, C., Mcintosh, S., Hassan, A. E. & Matsumoto, K. An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. IEEE Trans. Softw. Eng. 5589 (2016), 1–16.




How to Cite

Seong, T. W., & Ibrahim, M. Z. (2018). A Review of Audio-Visual Speech Recognition. Journal of Telecommunication, Electronic and Computer Engineering (JTEC), 10(1-4), 35–40. Retrieved from https://jtec.utem.edu.my/jtec/article/view/3573

Most read articles by the same author(s)