Face Reenactment: Crafting Realistic Talking Heads for Enhanced Video Communication and Beyond

Madhav Agarwal

Abstract

Face Reenactment and Synthetic Talking Head works have been widely popular for creating realistic face animations by using a single image of a person. In light of the recent developments in processing facial features in images and videos, as well as the ability to create realistic talking heads, We are focusing on two promising applications. These applications include utilizing face reenactment for movie dubbing and compressing video calls where the primary object is a talking face. We propose a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints. We use audio as an additional input for high-quality lip sync, by helping the network to attend to the mouth region. We use additional priors using face segmentation and face mesh to preserve the structure of the reconstructed faces. Finally, we incorporate a carefully designed identity- aware generator module to get realistic quality of talking heads. The identity-aware generator takes the source image and the warped motion features as input to generate a high-quality output with fine-grained details. Our method produces state-of-the-art results and generalizes well to unseen faces, languages, and voices. We comprehensively evaluate our approach using multiple metrics and outperforming the current techniques both qualitative and quantitatively. Our work opens up several applications, including enabling low-bandwidth video calls and movie dubbing. We leverage the advancements in talking head generation to propose an end-to-end system for video call compression. Our algorithm transmits pivot frames intermittently while the rest of the talking head video is generated by animating them. We use a state-of-the-art face reenactment network to detect keypoints in the non-pivot frames and transmit them to the receiver. A dense flow is then calculated to warp a pivot frame to reconstruct the non-pivot ones. Transmitting keypoints instead of full frames leads to significant compression. We propose a novel algorithm to adaptively select the best-suited pivot frames at regular intervals to provide a smooth experience. We also propose a frame-interpolater at the receiver’s end to improve the compression levels further. Finally, a face enhancement network improves reconstruction quality, significantly improving several aspects, like the sharpness of the generations. We evaluate our method both qualitatively and quantitatively on benchmark datasets and compare it with multiple compression techniques

Year of completion:	June 2023
Advisor :	C V Jawahar, Vinay P Namboodiri