Interactive Video Editing using Machine Learning Techniques

Anchit Gupta

Abstract

There is no doubt that videos are today's most popular content consumption method. With the rise of the streaming giants such as YouTube, Netflix, etc., video content is accessible to more people. Naturally, video content creation has also increased to cater to the rising demand. In order to reach out to a wider audience, the creators dub their content. An important aspect of dubbing is not only changing the speech but also lip synchronizing the speaker in the video. Talking-face video generation works have achieved state-of-the-art results in synthesizing videos with accurate lip synchronization. However, most of the previous works deal with low-resolution talking-face videos (up to 256 × 256 pixels), thus, generating extremely high-resolution videos still remains a challenge. Also, with advancements in internet and camera tech more and more number of people are able to create video content and that too in ultra high resolution such as 4K (3840 × 2160). In this thesis, we take a giant leap and propose a novel method to synthesize talking-face videos at resolutions as high as 4K! Our task presents several key challenges: (i) Scaling the existing methods to such high resolutions is resource-constrained, both in terms of compute and the availability of very high-resolution datasets, (ii) The synthesized videos need to be spatially and temporally coherent. The sheer number of pixels that the model needs to generate while maintaining the temporal consistency at the video level makes this task non-trivial and has never been attempted in literature. We propose to train the lip-sync generator in a compact Vector Quantized (VQ) space for the first time to address these issues. Our core idea to encode the faces in a compact 16 × 16 representation allows us to model high-resolution videos. In our framework, we learn the lip movements in the quantized space on the newly collected 4K Talking Faces (4KTF) dataset. Our approach is speaker agnostic and can handle various languages and voices. We benchmark our technique against several competitive works and show that we can achieve a remarkable 64-times more pixels than the current state-of-the-art! Now, how to edit videos using the above algorithm or any other deep learning algorithm? To do so, the person has to download the source code of the required method and run the code manually. How amazing would it be if people could use the deep learning techniques in video editors with a click of a single button? In this thesis, we also propose a video editor based on OpenShot with several state-of-theart facial video editing algorithms as added functionalities. Our editor provides an easy-to-use interface to apply modern lip-syncing algorithms interactively. Apart from lip-syncing, the editor also uses audio and facial re-enactment to generate expressive talking faces. The manual control improves the overall experience of video editing without missing out on the benefits of modern synthetic video generation algorithms. This control enables us to lip-sync complex dubbed movie scenes, interviews, television shows, and other visual content. Furthermore, our editor provides features that automatically translate lectures from spoken content, lip-sync of the professor, and background content like slides. While doing so, we also tackle the critical aspect of synchronizing background content with the translated speech. We qualitatively evaluate the usefulness of the proposed editor by conducting human evaluations. Our evaluations show a clear improvement in the efficiency of using human editors and an improved video generation quality

Year of completion:	November 2022
Advisor :	C V Jawahar,Vinay P Namboodiri