Visual recognition of human communications
Abstract:
The objective of this work is visual recognition of human communications.Solving this problem opens up a host of applications, such as transcribing archival silent films, or resolving multi-talker simultaneous speech,but most importantly it helps to advance the state of the art in speech recognition by enabling machines to take advantage of the multi-modal nature of human communications.
Training a deep learning algorithm requires a lot of training data.We propose a method to automatically collect, process and generate a large-scale audio-visual corpus from television videos temporally aligned with the transcript.To build such data-set, it is essential to know 'who' is speaking 'when'.We develop a ConvNet model that learns joint embedding of the sound and the mouth images from unlabeled data, and apply this network to the tasks of audio-to-video synchronization and active speaker detection.We also show that the methods developed here can be extended to the problem of generating talking faces from audio and still images, and re-dubbing videos with audio samples from different speakers.
We then propose a number of deep learning models that are able to recognize visual speech at sentence level.The lip reading performance beats a professional lip reader on videos from BBC television.We demonstrate that if audio is available, then visual information helps to improve speech recognition performance.We also propose methods to enhance noisy audio and to resolve multi-talker simultaneous speech using visual cues.
Finally, we explore the problem of speaker recognition.Whereas previous works for speaker identification have been limited to constrained conditions, here we build a new large-scale speaker recognition data-set collected from 'in the wild' videos using an automated pipeline. We propose a number of ConvNet architectures that outperforms traditional baselines on this data-set.
Bio:
Joon Son is a recent graduate from the Visual Geometry Group at the University of Oxford, and a research scientist at Naver Corp. His research interests are in computer vision and machine learning.
