The overall pipeline of our proposed Audio Visual Face Reenactment network (AVFR-GAN) is given in this Figure. We take source and driving images along with their face mesh and segmentation mask to extract keypoints. An audio encoder extracts features from driving audio and use them provide attention on lip region. The audio and visual feature maps are warped together and passed to the carefully designed Identity-Aware Generator along with extracted features of source image to generate the final output.
Abstract
This work proposes a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints. We improve the quality of lip sync using audio as an additional input, helping the network to attend to the mouth region. We use additional priors using face segmentation and face mesh to improve the structure of the reconstructed faces. Finally, we improve the visual quality of the generations by incorporating a carefully designed identity-aware generator module. The identity-aware generator takes the source image and the warped motion features as input to generate a high-quality output with fine-grained details. Our method produces state-of-the-art results and generalizes well to unseen faces, languages, and voices. We comprehensively evaluate our approach using multiple metrics and outperforming the current techniques both qualitative and quantitatively. Our work opens up several applications, including enabling low bandwidth video calls.
Paper
Audio-Visual Face Reenactment
Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar IEEE/CVF Winter Conference on Applications of Computer Vision,WACV, 2023. [PDF ] | [BibTeX]
@InProceedings{Agarwal_2023_WACV, author = {Agarwal, Madhav and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C. V.}, title = {Audio-Visual Face Reenactment}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {5178-5187} }
Demo
Contact
Madhav Agarwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.
My View is the Best View: Procedure Learning from Egocentric Videos
Given multiple videos of a task, the goal is to identify the key-steps and their order to perform the task.
Provided multiple videos of making a pizza, the goal is to identify the steps required to prepare the pizza and their order.
EgoProceL Dataset
EgoProceL is a large-scale dataset for procedure learning. It consists of 62 hours of egocentric videos recorded by 130 subjects performing 16 tasks for procedure learning. EgoProceL contains videos and key-step annotations for multiple tasks from CMU-MMAC, EGTEA Gaze+, and individual tasks like toy-bike assembly, tent assembly, PC assembly, and PC disassembly.
Why an egocentric dataset for Procedure Learning?
Using third-person videos for procedure learning makes the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action.
Existing datasets majorly consist of third-person videos for procedure learning. Third-person videos contain issues like occlusion and atypical camera locations that makes them ill-suited for procedure learning. Additionally, the datasets rely on videos from YouTube that are noisy. In contrast, we propose to use egocentric videos that overcome the issues posed by third-person videos. Third-person frames in the figure are from ProceL and CrossTask and the first-person frames are from EgoProceL.
Overview of EgoProceL
EgoProceL consists of
62 hours of videos captured by
130 subjects
performing 16 tasks
maximum of 17 key-steps
average 0.38 foreground ratio
average 0.12 missing steps ratio
average 0.49 repeated steps ratio
Downloads
We recommend referring to the README before downloading the videos. Mirror link.
We present a novel self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. Our experiments show that CnC outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets by 5.2% and 6.3%, respectively.
CnC takes in multiple videos from the same task and passes them through the embedder network trained using the proposed TC3I loss. The goal of the embedder network is to learn similar embeddings for corresponding key-steps from multiple videos and for temporally close frames. The ProCut Module (PCM) localizes the key-steps required for performing the task. PCM converts the clustering problem to a multi-label graph cut problem. The output provides the assignment of frames to the respective key-steps and their ordering.
This work was supported in part by the Department of Science and Technology, Government of India, under DST/ICPS/Data-Science project ID T-138. A portion of the data used in this paper was obtained from kitchen.cs.cmu.edu and the data collection was funded in part by the National Science Foundation under Grant No. EEEC-0540865. We acknowledge Pravin Nagar and Sagar Verma for recording and sharing the PC Assembly and Disassembly videos at IIIT Delhi. We also acknowledge Jehlum Vitasta Pandit and Astha Bansal for their help with annotating a portion of EgoProceL.
Please consider citing if you make use of the EgoProceL dataset and/or the corresponding code:
@InProceedings{EgoProceLECCV2022,
author="Bansal, Siddhant
and Arora, Chetan
and Jawahar, C.V.",
title="My View is the Best View: Procedure Learning from Egocentric Videos",
booktitle = "European Conference on Computer Vision (ECCV)",
year="2022"
}
@InProceedings{CMU_Kitchens,
author = "De La Torre, F. and Hodgins, J. and Bargteil, A. and Martin, X. and Macey, J. and Collado, A. and Beltran, P.",
title = "Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) database.",
booktitle = "Robotics Institute",
year = "2008"
}
@InProceedings{egtea_gaze_p,
author = "Li, Yin and Liu, Miao and Rehg, James M.",
title = "In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video",
booktitle = "European Conference on Computer Vision (ECCV)",
year = "2018"
}
@InProceedings{meccano,
author = "Ragusa, Francesco and Furnari, Antonino and Livatino, Salvatore and Farinella, Giovanni Maria",
title = "The MECCANO Dataset: Understanding Human-Object Interactions From Egocentric Videos in an Industrial-Like Domain",
booktitle = "Winter Conference on Applications of Computer Vision (WACV)",
year = "2021"
}
@InProceedings{tent,
author = "Jang, Youngkyoon and Sullivan, Brian and Ludwig, Casimir and Gilchrist, Iain and Damen, Dima and Mayol-Cuevas, Walterio",
title = "EPIC-Tent: An Egocentric Video Dataset for Camping Tent Assembly",
booktitle = "International Conference on Computer Vision (ICCV) Workshops",
year = "2019"
}
Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors
We solve the problem of upsampling extremely low-resolution (LR) talking-face videos to generate high-resolution (HR) outputs. Our approach exploits LR frames (8x8 pixels), corresponding audio signal and a single HR target identity image to synthesize realistic, high-quality talking-face videos (256x256 pixels).
Abstract
In this paper, we explore an interesting question of what can be obtained from an 8×8 pixel video sequence. Surprisingly, it turns out to be quite a lot. We show that when we process this 8x8 video with the right set of audio and image priors, we can obtain a full-length, 256x256 video. We achieve this 32x scaling of an extremely low-resolution input using our novel audio-visual upsampling network. The audio prior helps to recover the elemental facial details and precise lip shapes, and a single high-resolution target identity image prior provides us with rich appearance details. Our approach is an end-to-end multi-stage framework. The first stage produces a coarse intermediate output video that can be then used to animate the single target identity image and generate realistic, accurate and high-quality outputs. Our approach is simple and performs exceedingly well (an 8× improvement in FID score) compared to previous super-resolution methods. We also extend our model to talking-face video compression, and show that we obtain a 3.5x improvement in terms of bits/pixel over the previous state-of-the-art. The results from our network are thoroughly analyzed through extensive ablation and comparative analysis and demonstration videos (in the paper and supplementary material).
Paper
Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors
Sindhu B Hegde*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors, ACM-MM, 2022. [PDF ] | [BibTeX]
Updated Soon
Demo
--- COMING SOON ---
Contact
Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.
Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.
Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
We address the problem of generating speech from silent lip videos for any speaker in the wild. Previous works train either on large amounts of data of isolated speakers or in laboratory settings with a limited vocabulary. Conversely, we can generate speech for the lip movements of arbitrary identities in any voice without any additional speaker-specific fine-tuning. Our new VAE-GAN approach allows us to learn strong audio-visual associations despite the ambiguous nature of the task.
Abstract
In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works in lip-to-speech synthesis, our work (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges with the key one being that many features of the desired target speech like voice, pitch and linguistic content cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations. With the help of multiple powerful discriminators that guide the training process, our generator learns to synthesize speech sequences in any voice for the lip movements of any person. Extensive experiments on multiple datasets show that we outperform all baseline methods by a large margin. Further, our network can be fine-tuned on videos of specific identities to achieve a performance comparable to single-speaker models that are trained on 4x more data. We also conduct numerous ablation studies to analyze the effect of
different modules of our architecture. A demo video in supplementary material demonstrates several qualitative results and comparisons.
Paper
Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
Sindhu B Hegde*, K R Prajwal*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild, ACM-MM, 2022. [PDF ] | [BibTeX]
Updated Soon
Demo
--- COMING SOON ---
Contact
Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.
K R Prajwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.
Pipeline for efficient transfer of parameters from model trained on primary task like face-recognition to model for secondary task including gender, emotion, head pose and age in one pass. The ETL technique identifies and preserves the task related filters only which in turn results in highly sparse network for efficient training of face related tasks.
Abstract
Transfer learning is a popular method for obtaining deep trained models for data-scarce face tasks such as head pose and emotion. However, current transfer learning methods are inefficient and time-consuming as they do not fully account for the relationships between related tasks. Moreover, the transferred model is large and computationally expensive. As an alternative, we propose ETL: a technique that efficiently transfers a pre-trained model to a new task by retaining only \emph{cross-task aware filters}, resulting in a sparse transferred model. We demonstrate the effectiveness of ETL by transferring VGGFace, a popular face recognition model to four diverse face tasks. Our experiments show that we attain a size reduction up to 97\% and an inference time reduction up to 94\% while retaining 99.5\% of the baseline transfer learning accuracy.
Demo
Related Publications
ETL: Efficient Transfer Learning for Face tasks
Thrupthi Ann John, Isha Dua, Vineeth N Balasubramanian and C. V. Jawahar ETL: Efficient Transfer Learning for Face Tasks , 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2022. [ PDF ] , [ BibTeX ]
Contact
For any queries about the work, please contact the authors below
Thrupthi Ann John - thrupthi [dot] ann [at] research [dot] iiit [dot] ac [dot] in