Lip-to-Speech Synthesis

Rudrabha Mukhopadhyay

Abstract

This thesis explores the development and advancement of lip-to-speech synthesis techniques, addressing the challenge of generating speech directly from visual lip movements. Unlike text-to-speech systems that rely on explicit linguistic information in the form of text tokens, lip-to-speech synthesis, aims to interpret ambiguous visual cues, presenting unique challenges in mapping similar lip shapes that can produce different sounds. Inspired by the chronological advancements in text-to-speech synthesis the research goals are broken into single-speaker lip-to-speech where a specific model is trained for each speaker with a large amount of speaker-specific data followed by multi-speaker approaches which aims to train a single model which can work for any speaker in-the-wild.

The first work presented in this thesis deals with lip-to-speech generation problem in large vocabulary in unconstrained settings albeit with a model trained for a particular speaker. In this work, a novel sequence-to-sequence model was introduced that leveraged spatio-temporal convolutional architectures to effectively capture the fine-grained temporal dynamics of lip movements and implemented a monotonic attention mechanism that more accurately aligned the visual features with corresponding speech parameters. Testing on the LRS2 dataset showed a 24% improvement in intelligibility metrics over baseline methods. In this work, a new dataset was released providing sufficient speaker-specific data with a diverse vocabulary of around 5, 000 words to support the development of accurate, speaker-specific models. While this approach showed promise, it was obviously limited to single-speaker scenarios and failed to scale effectively to sentence-level multi-speaker tasks, necessitating further research.

Year of completion:	June 2025
Advisor :	Dr. C.V. Jawahar