CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors


Sindhu B Hegde* , Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       University of Oxford       Univ. of Bath

ACM-MM, 2022

[ Code ]   | [ Paper ] | [ Demo Video ]

banner img v1

We solve the problem of upsampling extremely low-resolution (LR) talking-face videos to generate high-resolution (HR) outputs. Our approach exploits LR frames (8x8 pixels), corresponding audio signal and a single HR target identity image to synthesize realistic, high-quality talking-face videos (256x256 pixels).

Abstract

In this paper, we explore an interesting question of what can be obtained from an 8×8 pixel video sequence. Surprisingly, it turns out to be quite a lot. We show that when we process this 8x8 video with the right set of audio and image priors, we can obtain a full-length, 256x256 video. We achieve this 32x scaling of an extremely low-resolution input using our novel audio-visual upsampling network. The audio prior helps to recover the elemental facial details and precise lip shapes, and a single high-resolution target identity image prior provides us with rich appearance details. Our approach is an end-to-end multi-stage framework. The first stage produces a coarse intermediate output video that can be then used to animate the single target identity image and generate realistic, accurate and high-quality outputs. Our approach is simple and performs exceedingly well (an 8× improvement in FID score) compared to previous super-resolution methods. We also extend our model to talking-face video compression, and show that we obtain a 3.5x improvement in terms of bits/pixel over the previous state-of-the-art. The results from our network are thoroughly analyzed through extensive ablation and comparative analysis and demonstration videos (in the paper and supplementary material).

Paper

  • Paper
    Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors

    Sindhu B Hegde*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
    Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors, ACM-MM, 2022.
    [PDF ] | [BibTeX]

    Updated Soon

Demo

--- COMING SOON ---


Contact

  1. Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild


Sindhu B Hegde* , K R Prajwal* Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       University of Oxford       Univ. of Bath

ACM-MM, 2022

[ Code ]   | [ Paper ] | [ Demo Video ]

banner style3

We address the problem of generating speech from silent lip videos for any speaker in the wild. Previous works train either on large amounts of data of isolated speakers or in laboratory settings with a limited vocabulary. Conversely, we can generate speech for the lip movements of arbitrary identities in any voice without any additional speaker-specific fine-tuning. Our new VAE-GAN approach allows us to learn strong audio-visual associations despite the ambiguous nature of the task.

Abstract

In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works in lip-to-speech synthesis, our work (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges with the key one being that many features of the desired target speech like voice, pitch and linguistic content cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations. With the help of multiple powerful discriminators that guide the training process, our generator learns to synthesize speech sequences in any voice for the lip movements of any person. Extensive experiments on multiple datasets show that we outperform all baseline methods by a large margin. Further, our network can be fine-tuned on videos of specific identities to achieve a performance comparable to single-speaker models that are trained on 4x more data. We also conduct numerous ablation studies to analyze the effect of different modules of our architecture. A demo video in supplementary material demonstrates several qualitative results and comparisons.

Paper

  • Paper
    Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

    Sindhu B Hegde*, K R Prajwal*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
    Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild, ACM-MM, 2022.
    [PDF ] | [BibTeX]

    Updated Soon

Demo

--- COMING SOON ---


Contact

  1. Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. K R Prajwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
  3. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

ETL: Efficient Transfer Learning for Face Tasks


Thrupthi Ann John[1], Isha Dua[1], Vineeth N Balasubramanian[2] and C.V. Jawahar[1]

IIIT Hyderabad[1] IIT Hyderabad[2]

[ Video ]   | [ PDF ]

CMS main

Pipeline for efficient transfer of parameters from model trained on primary task like face-recognition to model for secondary task including gender, emotion, head pose and age in one pass. The ETL technique identifies and preserves the task related filters only which in turn results in highly sparse network for efficient training of face related tasks.

 

Abstract

Transfer learning is a popular method for obtaining deep trained models for data-scarce face tasks such as head pose and emotion. However, current transfer learning methods are inefficient and time-consuming as they do not fully account for the relationships between related tasks. Moreover, the transferred model is large and computationally expensive. As an alternative, we propose ETL: a technique that efficiently transfers a pre-trained model to a new task by retaining only \emph{cross-task aware filters}, resulting in a sparse transferred model. We demonstrate the effectiveness of ETL by transferring VGGFace, a popular face recognition model to four diverse face tasks. Our experiments show that we attain a size reduction up to 97\% and an inference time reduction up to 94\% while retaining 99.5\% of the baseline transfer learning accuracy.

Demo


Related Publications

ETL: Efficient Transfer Learning for Face tasks

Thrupthi Ann John, Isha Dua, Vineeth N Balasubramanian and C. V. Jawahar
ETL: Efficient Transfer Learning for Face Tasks , 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2022.  [ PDF ] , [ BibTeX ]

Contact

For any queries about the work, please contact the authors below

  1. Thrupthi Ann John - thrupthi [dot] ann [at] research [dot] iiit [dot] ac [dot] in
  2. Isha Dua: duaisha1994 [at] gmail [dot] com

Canonical Saliency Maps: Decoding Deep Face Models


Thrupthi Ann John[1], Vineeth N Balasubramanian[2] and C.V. Jawahar[1]

IIIT Hyderabad[1] IIT Hyderabad[2]

[ Code ]   | [ Demo Video ]

CMS main

Abstract

As Deep Neural Network models for face processing tasks approach human-like performance, their deployment in critical applications such as law enforcement and access control has seen an upswing, where any failure may have far-reaching consequences. We need methods to build trust in deployed systems by making their working as transparent as possible. Existing visualization algorithms are designed for object recognition and do not give insightful results when applied` to the face domain. In this work, we present `Canonical Saliency Maps', a new method which highlights relevant facial areas by projecting saliency maps onto a canonical face model. We present two kinds of Canonical Saliency Maps: image-level maps and model-level maps. Image-level maps highlight facial features responsible for the decision made by a deep face model on a given image, thus helping to understand how a DNN made a prediction on the image. Model-level maps provide an understanding of what the entire DNN model focuses on in each task, and thus can be used to detect biases in the model. Our qualitative and quantitative results show the usefulness of the proposed canonical saliency maps, which can be used on any deep face model regardless of the architecture.


Demo


Related Publications

Canonical Saliency Maps: Decoding Deep Face Models
Thrupthi Ann John, Vineeth N Balasubramanian and C. V. Jawahar
Canonical Saliency Maps: Decoding Deep Face Models , IEEE Transactions in Biometrics, Behavior and Identity Science 2021 Volume 3, Issue 4. [ PDF ] , [ BibTeX ]

Contact

For any queries about the work, please contact the authors below

  1. Thrupthi Ann John - thrupthi [dot] ann [at] research [dot] iiit [dot] ac [dot] in

3DHumans

High-Fidelity 3D Scans of People in Diverse Clothing Styles

 

00

 

About

3DHumans dataset provides around 180 meshes of people in diverse body shapes in various garments styles and sizes. We cover a wide variety of clothing styles, ranging from loose robed clothing, like saree (a typical South-Asian dress) to relatively tight fit clothing, like shirts and trousers. Along with the high quality geometry (mesh) and texture map, we also provide registered SMPL's parameters. The faces of the subjects are blurred and smoothened out to maintain privacy. You can watch the demo video Here.

 

Quality

The dataset is collected using Artec Eva hand held structured light scanner. The scanner has 3D point accuracy up to 0.1 mm and 3D resolution of 0.5 mm, enabling capture of high frequency geometrical details, alongwith high resolution texture maps. The subjects were scanned in a studio environment with controlled lighting and uniform illumination.

vis

Download Sample

Please click here to download a sample from the full dataset

 

Request Full Dataset

To get access to the dataset, please fill and sign the agreement document and send it via email to manager.rnd[AT]iiit.ac.in and asharma[AT]iiit.ac.in with the subject line "Requesting access to 3DHumans (IIITH) dataset". Upon acceptance of your request, you will receive an expirable link with a password from which you can download the dataset. If you find our dataset useful, please cite our technical paper as given below.

 

Technical Paper

The 3DHumans dataset was first introduced in our technical paper: SHARP: Shape-Aware Reconstruction of People in Loose Clothing (IJCV, 2022)

 

Citation

If you use our dataset, kindly cite the corresponding technical paper as follows:

@article{Jinka2022,
		doi = {10.1007/s11263-022-01736-z},
		url = {https://doi.org/10.1007/s11263-022-01736-z},
		year = {2022},
		month = dec,
		publisher = {Springer Science and Business Media {LLC}},
		author = {Sai Sagar Jinka and Astitva Srivastava and Chandradeep Pokhariya and Avinash Sharma and P. J. Narayanan},
		title = {SHARP: Shape-Aware Reconstruction of People in Loose Clothing},
		journal = {International Journal of Computer Vision}
		}

Acknowledgements

Dataset collection was financially supported by a DST grant (DST/ICPS/ IHDS/2018) and partially facilitated with manpower support from IHub, IIIT Hyderabad

More Articles …

  1. Classroom Slide Narration System
  2. Handwritten Text Retrieval from Unlabeled Collections
  3. Audio-Visual Speech Super-Resolution
  4. MeronymNet: A Hierarchical Model for Unified and Controllable Multi-Category Object Generation
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.