CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis


Prajwal Renukanand*   Rudrabha Mukhopadhyay*   Vinay Namboodiri   C.V. Jawahar

IIIT Hyderabad       IIT Kanpur

CVPR 2020

[Code]   [Data]

Please click here to redirect to watch our video in Youtube.

word

In this work, we propose a sequence-to-sequence architecture for accurate speech generation from silent lip videos in unconstrained settings for the first time. The text in the bubble is manually transcribed and is shown for presentation purposes.

Abstract

Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose an approach to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is almost twice as intelligible as previous works in this space.


Paper

  • Paper
    Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

    Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
    Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis, CVPR, 2020 (Accepted).
    [PDF] | [BibTeX]

    @InProceedings{Prajwal_2020_CVPR,
    author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
    title = {Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis},
    booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2020}
    }

Live Demo

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis (CVPR, 2020) : Please click here to redirect to Youtube.


Dataset

word

Our dataset contains lectures and chess commentary as of now.

We introduce a new benchmark dataset for unconstrained lip to speech synthesis that is tailored towards exploring the following line of thought: How accurately can we infer an individual’s speech style and content from his/her lip movements? To create this dataset, we collect a total of about 175 hours of talking face videos across 6 speakers. Our dataset is far more unconstrained and natural than older datasets like the GRID corpus and TIMIT dataset. All the corpuses are compared in the table given below.

word

Comparison of our dataset with other datasets which has been used earlier for video-to-speech generation

To access the dataset please click this link or the link given near the top of the page. We release the youtube ids of the videos used. In case, the videos are not present in Youtube, please contact us for an alternate link.

Architecture

word

Architecture for generating speech from lip movements

Our network consists of a spatio-temporal encoder and a attention based decoder. The spatio-temporal encoder takes multiple T frames as input and passes through a 3D CNN based encoder. We feed the output from 3D CNN based encoder to a attention based speech decoder to generate melspectrograms following the seq-to-seq paradigm. For more information about our model and different design choices we make please go through our paper.

Contact

  1. Prajwal K R - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

IndicSpeech: Text-to-Speech Corpus for Indian Languages

 

  [Dataset]word

Word clouds of the collected corpus for 3 languages

Abstract

India is a country where several tens of languages are spoken by over a billion strong population. Text-to-speech systems for such languages will thus be extremely beneficial for wide-spread content creation and accessibility. Despite this, the current TTS systems for even the most popular Indian languages fall short of the contemporary state-of-the-art systems for English, Chinese, etc. We believe that one of the major reasons for this is the lack of large, publicly available text-to-speech corpora in these languages that are suitable for training neural text-to-speech systems. To mitigate this, we release a large text-to-speech corpus for $3$ major Indian languages namely Hindi, Malayalam and Bengali. In this work, we also train a state-of-the-art TTS system for each of these languages and report their performances.


Paper

  • IndicSpeech: Text-to-Speech Corpus for Indian Languages

    Nimisha Srivastava, Rudrabha Mukhopadhyay*, Prajwal K R*, C.V. Jawahar
    IndicSpeech: Text-to-Speech Corpus for Indian Languages, LREC, 2020 
    [PDF] | [BibTeX]

    @inproceedings{srivastava-etal-2020-indicspeech,
    title = "{I}ndic{S}peech: Text-to-Speech Corpus for {I}ndian Languages",
    author = "Srivastava, Nimisha and
    Mukhopadhyay, Rudrabha and
    K R, Prajwal and
    Jawahar, C V",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.789",
    pages = "6417--6422",
    abstract = "India is a country where several tens of languages are spoken by over a billion strong population. Text-to-speech systems for such languages will thus be extremely beneficial for wide-spread content creation and accessibility. Despite this, the current TTS systems for even the most popular Indian languages fall short of the contemporary state-of-the-art systems for English, Chinese, etc. We believe that one of the major reasons for this is the lack of large, publicly available text-to-speech corpora in these languages that are suitable for training neural text-to-speech systems. To mitigate this, we release a 24 hour text-to-speech corpus for 3 major Indian languages namely Hindi, Malayalam and Bengali. In this work, we also train a state-of-the-art TTS system for each of these languages and report their performances. The collected corpus, code, and trained models are made publicly available.",
    language = "English",
    ISBN = "979-10-95546-34-4",
    }

Live Demo

Please click here for demo video : https://bhaasha.iiit.ac.in/indic-tts/


Contact

  1. Prajwal K R - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

RoadText-1K: Text Detection & Recognition Dataset for Driving Videos


Abstract

Perceiving text is crucial to understand semantics of outdoor scenes and hence is a critical requirement to build intelligent systems for driver assistance and self-driving. Most of the existing datasets for text detection and recognition comprise still images and are mostly compiled keeping text in mind. This paper introduces a new "RoadText-1K" dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. Our dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame. State of the art methods for text detection, recognition and tracking are evaluated on the new dataset and the results signify the challenges in unconstrained driving videos compared to existing datasets. This suggests that RoadText-1K is suited for research and development of reading systems, robust enough to be incorporated into more complex downstream tasks like driver assistance and self-driving.

 road1k4


Related Publications

  • Sangeeth Reddy,Minesh Mathew, Lluis Gomez, Marçal Rusinol, Dimosthenis Karatzas and C. V. Jawahar ,RoadText-1K: Text Detection & Recognition Dataset for Driving Videos , International Conference on Robotics and Automation (ICRA 2020). [pdf]

Dataset

RoadText-1K Video: Link

RoadText-1K Dataset: Link

Please contact at This email address is being protected from spambots. You need JavaScript enabled to view it.


Munich to Dubai: How far is it for Semantic Segmentation?


Abstract

Cities having hot weather conditions results in geometrical distortion, thereby adversely affecting the performance of semantic segmentation model. In this work, we study the problem of semantic segmentation model in adapting to such hot climate cities. This issue can be circumvented by collecting and annotating images in such weather conditions and training segmentation models on those images. But the task of semantically annotating images for every environment is painstaking and expensive. Hence, we propose a framework that improves the performance of semantic segmentation models without explicitly creating an annotated dataset for such adverse weather variations. Our framework consists of two parts, a restoration network to remove the geometrical distortions caused by hot weather and an adaptive segmentation network that is trained on an additional loss to adapt to the statistics of the ground-truth segmentation map. We train our framework on the Cityscapes dataset, which showed a total IoU gain of 12.707 over standard segmentation models. We also observe that the segmentation results obtained by our framework gave a significant improvement for small classes such as poles, person, and rider, which are essential and valuable for autonomous navigation based applications.

Overview

wacv-2020

Results

wacv-2020

Related Publications

  • Shyam Nandan Rai, Vineeth N Balasubramanian, Anbumani Subramanian and C. V. Jawahar , Munich to Dubai: How far is it for Semantic Segmentation? , Winter Conference on Applications of Computer Vision (WACV 2020). [pdf], [Supp] and [code]

Please contact at This email address is being protected from spambots. You need JavaScript enabled to view it.


DeepHuMS: Deep Human Action Signature for3D Skeletal Sequences


[Video ]

Abstract:

3D Human Action Indexing and Retrieval is an interesting problem due to the rise of several data-driven applications aimed atanalyzing and/or reutilizing 3D human skeletal data, such as data-driven animation, analysis of sports bio-mechanics, human surveillance etc. Spatio-temporal articulations of humans, noisy/missing data, different speeds of the same action etc. make it challenging and several of the existing state of the art methods use hand-craft features along with optimization based or histogram based comparison in order to perform retrieval. Further, they demonstrate it only for very small datasets and few classes. We make a case for using a learned representation that should recognize the action as well as enforce a discriminative ranking. To that end, we propose, a 3D human action descriptor learned using a deep network. Our learned embedding is generalizable and applicable toreal-world data - addressing the aforementioned challenges and further enables sub-action searching in its embedding space using another network. Our model exploits the inter-class similarity using trajectory cues,and performs far superior in a self-supervised setting. State of the art results on all these fronts is shown on two large scale 3D human action datasets - NTU RGB+D and HDM05.

Method

In this paper, we attempt to solve the problem of indexing and retrieval of 3D skeletal videos. In order to build a 3D human action descriptor, we need to exploit the spatio-temporal features in the skeletal action data. Briefly, we have three key components - (i) the input skeletal location and joint level action trajectories to thenext frame, (ii) an RNN to model this temporal data and (iii) a novel trajectorybased similarity metric (explained below) to project similar content togetherusing a Siamese architecture. We use two setups to train our model - (a) self-supervised, with a ”contrastive loss” to train our Siamesemodel and (b) supervised setup, with a cross entropy on our embedding, in addition to the self-supervision.

Overview of our model - DeepHuMS. Given two skeleton sequences (a), we first extract the 3D joint locations and action field between consecutive frames to represent the spatio-temporal data (b). The two are concatenated together and given to an RNN to model the 4D data (c). The resulting embeddings (d) are compared based using (e) contrastive loss (and optionally classification loss) to make them ”discriminative” and ”recognition-robust”. Similarity is enforcedbased on the full sequence’s action distance and action field. At the time of retrieval, given a 3D sequence to the network (g), with the resultant embedding, a nearest neighbour search is done in the embedding space (f) generated from the training data.

Contributions:

  • We propose a novel deep learning model that makes use of trajectory cues,and optionally class labels, in order to build a discriminative and robust 3Dhuman action descriptor for retrieval.
  • Further, we perform sub-action search by learning a mapping from sub-sequences to longer sequences in the dataset by means of another network.
  • Experiments are performed, both, with and without class label supervision. We demonstrate our model’s ability to exploit the inter-class action similarity better in the unsupervised setting, thus, resulting in a more generalized solution.
  • Our model is learned on noisy/missing data as well as actions of different speeds and its robustness in such scenarios indicates its applicability to real world data.
  • A comparison of our retrieval performance with the publicly available stat eof the art in 3D action recognition as well as 3D action retrieval on two largescale publicly available datasets is done to demonstrate the state-of-the-art results of the proposed model.

Related Publication:

  • Neeraj Battan, Abbhinav Venkat, Avinash Sharma - DeepHuMS: Deep Human Action Signature for3D Skeletal Sequences, (ACPR 2019((Oral))). [pdf] , [Video] , [Code]

More Articles …

  1. Surface Reconstruction
  2. Towards Automatic Face-to-Face Translation
  3. Point Cloud Analysis
  4. SplineNet: B-spline neural network for efficient classification of 3D data
  • Start
  • Prev
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.