Audio-Visual Speech Recognition and Synthesis


Abhishek Jha


Understanding speech in the absence of audio, from the visual perception of lip-motion can aid a variety of computer vision applications. System comprehending ‘silent speech presents a promising potential for low bandwidth video-calling, speech transmission in auditory noisy environment to aid for hearing impaired. While presenting numerous opportunities, it is highly difficult to model lips in silent speech video by observing lip-motion of speaker. Albeit developments in automatic-speech recognition (ASR) has yielded better audio-speech recognition systems in last two decades, in the presence of noise their performance drastically deteriorates. This calls for a computer vision solution to the speech under- standing problem. In this thesis, we present two solutions for modelling lips in silent speech videos. In the first part of the thesis, we propose a word-spotting solution for searching spoken keywords in silent lip-videos. In this work on visual speech recognition our contributions are twofold: we develop a pipeline for recognition-free retrieval, and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. 2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatio-temporal landmarks of the query and the top retrieval candidates. The proposed pipeline improves baseline performance by over 35% for word-spotting task on one of the largest lipreading corpus. We demonstrate the robustness of our method through a series of experiments, by investigating domain invariance, out-of-vocabulary prediction and careful analysis of results on dataset. We also present the qualitative results showing success and failure cases. We finally show the application of our method by spotting words in an archaic speech video. In the second part of our work, we propose a lip-synchronization solution for ‘visually redubbing speech videos in a target language. Current methods of adapting a native speech video in foreign lan- guage either through placement of subtitle in the video, which distracts the viewer or through audio redubbing the video in the target language. This causes unsynchronized lip-motion of the speaker with respect to the redubbed audio, resulting in video appearing unnatural. In this work, we propose two lip synchronization methods: 1) cross-accent lip-synchronization for change in accent of the same language audio dubbing, and 2) cross-language lip-synchronization for speech videos dubbed in a differentlanguage. Since viseme remains the same in cross-accent dubbing, we propose a dynamic programing algorithm to align the visual speech from the original video with the accented speech in the target audio. In cross-language dubbing overall linguistics changes, hence we propose a lip-synthesis model conditioned upon on the redubbed audio. Finally, a user-based study is conducted, which validates our claim of better viewing experience in comparison to baseline methods. We present the application of both these methods by ‘visually redubbing Andrew Ngs machine learning tutorial video clips in Indian accented English and Hindi language respectively. In the final part of this thesis, we propose an improved method of 2D lip-landmark localization method. We investigated the current landmark localization techniques in facial domain and human- pose estimation to discover the shortcoming in adapting these methods for the task of lip-landmark localization. Present state-of-the-art methods in the domain considers lip-landmarks as a subset of facial landmarks and hence doesn’t explicitly optimizes for it. In this work we propose a new lip-centric loss formulation on the existing stacked-hourglass architecture which improves the baseline performance. Finally we use 300W and 300VW faces dataset to show the performance of our methods and compare them with the baselines. Overall, in this thesis we examined the current methods of lip modelling, investigated them for their shortcomings and proposed solutions to overcome those challenges. We perform detailed studies, and ablation studies to study our proposed methods and reported both success and failure cases for the same. We compare our solutions with the current baseline on challenging datasets, reporting quantitative results and demonstrating qualitative performances. Our proposed solutions improves the baseline performances in their individual domains.


Year of completion:  April 2019
 Advisor : C V Jawahar, Vinay P. Namboodiri

Related Publications

  • Abhishek JhaVinay P. Namboodiri and C.V. Jawahar - Spotting words in silent speech videos: a retrieval-based approach,  Machine Vision and Applications 2019 [PDF]

  • Abhishek Jha, Vinay Namboodiri and C.V. JawaharWord Spotting in Silent Lip Videos, IEEE Winter Conference on Applications of Computer Vision (WACV 2018), Lake Tahoe, CA, USA, 2018 [PDF]