Lip-syncing Videos In The Wild

Prajwal K R

Abstract

The widespread access to the Internet has led to a meteoric rise in audio-visual content consumption. Our content consumption habits have changed from listening to podcasts and radio broadcasts to watching videos on YouTube. We are now increasingly preferring the highly engaging nature of video calls over plain voice calls. Given this considerable shift in desire for audio-visual content, there has also been a surge in video content creation to cater to these consumption needs. In this fabric of video content creation, especially those containing people talking, lies the problem of making these videos accessible across language barriers. If we want to translate a deep learning lecture video in English to Hindi, it is not only that the speech should be translated but also the visual stream, specifically, the lip movements. Learning to lip-sync arbitrary videos to any desired target speech is a problem with several applications ranging from video translation, to readily creating new content that would otherwise require humongous efforts. However, speaker-independent lip synthesis for any voice, and language is a very challenging task. In this thesis, we tackle the problem of lip-syncing videos in the wild to any given target speech. We propose two new models in this space: one that significantly improves the generation quality and the other significantly improving on lip-sync accuracy. In the first model, LipGAN, we identify key issues that plague the current approaches for speakerindependent lip synthesis that prevent them from reaching the generation quality of speaker-specific models. Specifically, ours is the first model to generate face images that can be pasted back into the video frame. This feature is crucial for all the real-world applications where the face is just a small part of the entire content being displayed. We show that our improvements in quality lead to multiple real-world applications that have not been demonstrated in any of the previous lip-sync works. In the second model, Wav2Lip, we investigate why current models are inaccurate while lip-syncing arbitrary talking face videos. We hypothesize that the reason is weak penalization. This finding allows us to create a lip-sync model that can generate lip-synced videos for any identity and voice with remarkable accuracy and quality. We re-think the current evaluation framework for this task and propose multiple new benchmarks, two new metrics, and a Real world lip Sync Evaluation Dataset (ReSyncED). Also, using our model, we show applications on lip-syncing dubbed movies and animating real CGI movie clips to new speech. We also demonstrate a futuristic video call application that is useful for poor network connections. Finally, we present two major appli cations that our model can impact the most social media content creation and personalization and video translation. We hope that our advances in lip synthesis open up new avenues for research in the space of talking face generation from speech.

Year of completion:	August 2020
Advisor :	C V Jawahar, Vinay P. Namboodiri