CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Audio-Visual Face Reenactment


Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

WACV, 2023

[ Code ]   | [ Interactive Demo ] | [ Demo Video ]

architecture final

The overall pipeline of our proposed Audio Visual Face Reenactment network (AVFR-GAN) is given in this Figure. We take source and driving images along with their face mesh and segmentation mask to extract keypoints. An audio encoder extracts features from driving audio and use them provide attention on lip region. The audio and visual feature maps are warped together and passed to the carefully designed Identity-Aware Generator along with extracted features of source image to generate the final output.

Abstract

This work proposes a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints. We improve the quality of lip sync using audio as an additional input, helping the network to attend to the mouth region. We use additional priors using face segmentation and face mesh to improve the structure of the reconstructed faces. Finally, we improve the visual quality of the generations by incorporating a carefully designed identity-aware generator module. The identity-aware generator takes the source image and the warped motion features as input to generate a high-quality output with fine-grained details. Our method produces state-of-the-art results and generalizes well to unseen faces, languages, and voices. We comprehensively evaluate our approach using multiple metrics and outperforming the current techniques both qualitative and quantitatively. Our work opens up several applications, including enabling low bandwidth video calls.

Paper

  • Paper
    Audio-Visual Face Reenactment

    Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar
    IEEE/CVF Winter Conference on Applications of Computer Vision,WACV, 2023.
    [PDF ] | [BibTeX]

    @InProceedings{Agarwal_2023_WACV,
    author = {Agarwal, Madhav and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C. V.},
    title = {Audio-Visual Face Reenactment},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month = {January},
    year = {2023},
    pages = {5178-5187}
    }

Demo

Your browser does not support the video tag.


Contact

  1. Madhav Agarwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

My View is the Best View: Procedure Learning from Egocentric Videos


Siddhant Bansal  Chetan Arora and C.V. Jawahar

ECCV 2022

PAPER       DATASET      CODE

What is Procedure Learning?

Given multiple videos of a task, the goal is to identify the key-steps and their order to perform the task.

 procedure learning

Provided multiple videos of making a pizza, the goal is to identify the steps required to prepare the pizza and their order.

 

EgoProceL Dataset

Your browser does not support the video tag.

EgoProceL is a large-scale dataset for procedure learning. It consists of 62 hours of egocentric videos recorded by 130 subjects performing 16 tasks for procedure learning. EgoProceL contains videos and key-step annotations for multiple tasks from CMU-MMAC, EGTEA Gaze+, and individual tasks like toy-bike assembly, tent assembly, PC assembly, and PC disassembly.

Why an egocentric dataset for Procedure Learning?

Using third-person videos for procedure learning makes the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action.
ECCV diagrams first person vs third person v1
Existing datasets majorly consist of third-person videos for procedure learning. Third-person videos contain issues like occlusion and atypical camera locations that makes them ill-suited for procedure learning. Additionally, the datasets rely on videos from YouTube that are noisy. In contrast, we propose to use egocentric videos that overcome the issues posed by third-person videos. Third-person frames in the figure are from ProceL and CrossTask and the first-person frames are from EgoProceL.

Overview of EgoProceL

EgoProceL consists of

  • 62 hours of videos captured by
  • 130 subjects
  • performing 16 tasks
  • maximum of 17 key-steps
  • average 0.38 foreground ratio
  • average 0.12 missing steps ratio
  • average 0.49 repeated steps ratio

Downloads

We recommend referring to the README before downloading the videos. Mirror link.

Videos

Link: pc-assembly

Link: pc-disassembly

Annotations

Link: Google Drive

CnC framework for Procedure Learning

We present a novel self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. Our experiments show that CnC outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets by 5.2% and 6.3%, respectively.

ECCV diagrams Methodology v0 5

CnC takes in multiple videos from the same task and passes them through the embedder network trained using the proposed TC3I loss. The goal of the embedder network is to learn similar embeddings for corresponding key-steps from multiple videos and for temporally close frames. The ProCut Module (PCM) localizes the key-steps required for performing the task. PCM converts the clustering problem to a multi-label graph cut problem. The output provides the assignment of frames to the respective key-steps and their ordering.

Paper

  • PDF: Paper; Supplementary
  • arXiv: Paper; Abstract
  • ECCV: Coming soon!

Code

The code for this work is available on GitHub!

Link: Sid2697/EgoProceL-egocentric-procedure-learning

Acknowledgements

This work was supported in part by the Department of Science and Technology, Government of India, under DST/ICPS/Data-Science project ID T-138. A portion of the data used in this paper was obtained from kitchen.cs.cmu.edu and the data collection was funded in part by the National Science Foundation under Grant No. EEEC-0540865. We acknowledge Pravin Nagar and Sagar Verma for recording and sharing the PC Assembly and Disassembly videos at IIIT Delhi. We also acknowledge Jehlum Vitasta Pandit and Astha Bansal for their help with annotating a portion of EgoProceL.

 

Please consider citing if you make use of the EgoProceL dataset and/or the corresponding code:

 
@InProceedings{EgoProceLECCV2022,
author="Bansal, Siddhant
and Arora, Chetan
and Jawahar, C.V.",
title="My View is the Best View: Procedure Learning from Egocentric Videos",
booktitle = "European Conference on Computer Vision (ECCV)", 
year="2022"
}

@InProceedings{CMU_Kitchens,
author = "De La Torre, F. and Hodgins, J. and Bargteil, A. and Martin, X. and Macey, J. and Collado, A. and Beltran, P.",
title = "Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) database.",
booktitle = "Robotics Institute",
year = "2008"
}

@InProceedings{egtea_gaze_p,
author = "Li, Yin and Liu, Miao and Rehg, James M.",
title =  "In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video",
booktitle = "European Conference on Computer Vision (ECCV)",
year = "2018"
}

@InProceedings{meccano,
    author    = "Ragusa, Francesco and Furnari, Antonino and Livatino, Salvatore and Farinella, Giovanni Maria",
    title     = "The MECCANO Dataset: Understanding Human-Object Interactions From Egocentric Videos in an Industrial-Like Domain",
    booktitle = "Winter Conference on Applications of Computer Vision (WACV)",
    year      = "2021"
}

@InProceedings{tent,
author = "Jang, Youngkyoon and Sullivan, Brian and Ludwig, Casimir and Gilchrist, Iain and Damen, Dima and Mayol-Cuevas, Walterio",
title = "EPIC-Tent: An Egocentric Video Dataset for Camping Tent Assembly",
booktitle = "International Conference on Computer Vision (ICCV) Workshops",
year = "2019"
}

Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors


Sindhu B Hegde* , Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       University of Oxford       Univ. of Bath

ACM-MM, 2022

[ Code ]   | [ Paper ] | [ Demo Video ]

banner img v1

We solve the problem of upsampling extremely low-resolution (LR) talking-face videos to generate high-resolution (HR) outputs. Our approach exploits LR frames (8x8 pixels), corresponding audio signal and a single HR target identity image to synthesize realistic, high-quality talking-face videos (256x256 pixels).

Abstract

In this paper, we explore an interesting question of what can be obtained from an 8×8 pixel video sequence. Surprisingly, it turns out to be quite a lot. We show that when we process this 8x8 video with the right set of audio and image priors, we can obtain a full-length, 256x256 video. We achieve this 32x scaling of an extremely low-resolution input using our novel audio-visual upsampling network. The audio prior helps to recover the elemental facial details and precise lip shapes, and a single high-resolution target identity image prior provides us with rich appearance details. Our approach is an end-to-end multi-stage framework. The first stage produces a coarse intermediate output video that can be then used to animate the single target identity image and generate realistic, accurate and high-quality outputs. Our approach is simple and performs exceedingly well (an 8× improvement in FID score) compared to previous super-resolution methods. We also extend our model to talking-face video compression, and show that we obtain a 3.5x improvement in terms of bits/pixel over the previous state-of-the-art. The results from our network are thoroughly analyzed through extensive ablation and comparative analysis and demonstration videos (in the paper and supplementary material).

Paper

  • Paper
    Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors

    Sindhu B Hegde*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
    Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors, ACM-MM, 2022.
    [PDF ] | [BibTeX]

    Updated Soon

Demo

--- COMING SOON ---


Contact

  1. Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild


Sindhu B Hegde* , K R Prajwal* Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       University of Oxford       Univ. of Bath

ACM-MM, 2022

[ Code ]   | [ Paper ] | [ Demo Video ]

banner style3

We address the problem of generating speech from silent lip videos for any speaker in the wild. Previous works train either on large amounts of data of isolated speakers or in laboratory settings with a limited vocabulary. Conversely, we can generate speech for the lip movements of arbitrary identities in any voice without any additional speaker-specific fine-tuning. Our new VAE-GAN approach allows us to learn strong audio-visual associations despite the ambiguous nature of the task.

Abstract

In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works in lip-to-speech synthesis, our work (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges with the key one being that many features of the desired target speech like voice, pitch and linguistic content cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations. With the help of multiple powerful discriminators that guide the training process, our generator learns to synthesize speech sequences in any voice for the lip movements of any person. Extensive experiments on multiple datasets show that we outperform all baseline methods by a large margin. Further, our network can be fine-tuned on videos of specific identities to achieve a performance comparable to single-speaker models that are trained on 4x more data. We also conduct numerous ablation studies to analyze the effect of different modules of our architecture. A demo video in supplementary material demonstrates several qualitative results and comparisons.

Paper

  • Paper
    Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

    Sindhu B Hegde*, K R Prajwal*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
    Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild, ACM-MM, 2022.
    [PDF ] | [BibTeX]

    Updated Soon

Demo

--- COMING SOON ---


Contact

  1. Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. K R Prajwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
  3. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

ETL: Efficient Transfer Learning for Face Tasks


Thrupthi Ann John[1], Isha Dua[1], Vineeth N Balasubramanian[2] and C.V. Jawahar[1]

IIIT Hyderabad[1] IIT Hyderabad[2]

[ Video ]   | [ PDF ]

CMS main

Pipeline for efficient transfer of parameters from model trained on primary task like face-recognition to model for secondary task including gender, emotion, head pose and age in one pass. The ETL technique identifies and preserves the task related filters only which in turn results in highly sparse network for efficient training of face related tasks.

 

Abstract

Transfer learning is a popular method for obtaining deep trained models for data-scarce face tasks such as head pose and emotion. However, current transfer learning methods are inefficient and time-consuming as they do not fully account for the relationships between related tasks. Moreover, the transferred model is large and computationally expensive. As an alternative, we propose ETL: a technique that efficiently transfers a pre-trained model to a new task by retaining only \emph{cross-task aware filters}, resulting in a sparse transferred model. We demonstrate the effectiveness of ETL by transferring VGGFace, a popular face recognition model to four diverse face tasks. Our experiments show that we attain a size reduction up to 97\% and an inference time reduction up to 94\% while retaining 99.5\% of the baseline transfer learning accuracy.

Demo


Related Publications

ETL: Efficient Transfer Learning for Face tasks

Thrupthi Ann John, Isha Dua, Vineeth N Balasubramanian and C. V. Jawahar
ETL: Efficient Transfer Learning for Face Tasks , 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2022.  [ PDF ] , [ BibTeX ]

Contact

For any queries about the work, please contact the authors below

  1. Thrupthi Ann John - thrupthi [dot] ann [at] research [dot] iiit [dot] ac [dot] in
  2. Isha Dua: duaisha1994 [at] gmail [dot] com

More Articles …

  1. Canonical Saliency Maps: Decoding Deep Face Models
  2. 3DHumans High-Fidelity 3D Scans of People in Diverse Clothing Styles
  3. Classroom Slide Narration System
  4. Handwritten Text Retrieval from Unlabeled Collections
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.