CVIT Projects

IndicSpeech: Text-to-Speech Corpus for Indian Languages

[Dataset]

Word clouds of the collected corpus for 3 languages

Abstract

India is a country where several tens of languages are spoken by over a billion strong population. Text-to-speech systems for such languages will thus be extremely beneficial for wide-spread content creation and accessibility. Despite this, the current TTS systems for even the most popular Indian languages fall short of the contemporary state-of-the-art systems for English, Chinese, etc. We believe that one of the major reasons for this is the lack of large, publicly available text-to-speech corpora in these languages that are suitable for training neural text-to-speech systems. To mitigate this, we release a large text-to-speech corpus for $3$ major Indian languages namely Hindi, Malayalam and Bengali. In this work, we also train a state-of-the-art TTS system for each of these languages and report their performances.

Paper

IndicSpeech: Text-to-Speech Corpus for Indian Languages

Nimisha Srivastava, Rudrabha Mukhopadhyay*, Prajwal K R*, C.V. Jawahar
IndicSpeech: Text-to-Speech Corpus for Indian Languages, LREC, 2020
[PDF] | [BibTeX]

@inproceedings{srivastava-etal-2020-indicspeech,
title = "{I}ndic{S}peech: Text-to-Speech Corpus for {I}ndian Languages",
author = "Srivastava, Nimisha and
Mukhopadhyay, Rudrabha and
K R, Prajwal and
Jawahar, C V",
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.789",
pages = "6417--6422",
abstract = "India is a country where several tens of languages are spoken by over a billion strong population. Text-to-speech systems for such languages will thus be extremely beneficial for wide-spread content creation and accessibility. Despite this, the current TTS systems for even the most popular Indian languages fall short of the contemporary state-of-the-art systems for English, Chinese, etc. We believe that one of the major reasons for this is the lack of large, publicly available text-to-speech corpora in these languages that are suitable for training neural text-to-speech systems. To mitigate this, we release a 24 hour text-to-speech corpus for 3 major Indian languages namely Hindi, Malayalam and Bengali. In this work, we also train a state-of-the-art TTS system for each of these languages and report their performances. The collected corpus, code, and trained models are made publicly available.",
language = "English",
ISBN = "979-10-95546-34-4",
}

Live Demo

Please click here for demo video : https://bhaasha.iiit.ac.in/indic-tts/

Contact

Prajwal K R - This email address is being protected from spambots. You need JavaScript enabled to view it.
Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

RoadText-1K: Text Detection & Recognition Dataset for Driving Videos

Abstract

Perceiving text is crucial to understand semantics of outdoor scenes and hence is a critical requirement to build intelligent systems for driver assistance and self-driving. Most of the existing datasets for text detection and recognition comprise still images and are mostly compiled keeping text in mind. This paper introduces a new "RoadText-1K" dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. Our dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame. State of the art methods for text detection, recognition and tracking are evaluated on the new dataset and the results signify the challenges in unconstrained driving videos compared to existing datasets. This suggests that RoadText-1K is suited for research and development of reading systems, robust enough to be incorporated into more complex downstream tasks like driver assistance and self-driving.

road1k4

Related Publications

Sangeeth Reddy,Minesh Mathew, Lluis Gomez, Marçal Rusinol, Dimosthenis Karatzas and C. V. Jawahar ,RoadText-1K: Text Detection & Recognition Dataset for Driving Videos , International Conference on Robotics and Automation (ICRA 2020). [pdf]

Dataset

RoadText-1K Video: Link

RoadText-1K Dataset: Link

Please contact at This email address is being protected from spambots. You need JavaScript enabled to view it.

Munich to Dubai: How far is it for Semantic Segmentation?

Abstract

Cities having hot weather conditions results in geometrical distortion, thereby adversely affecting the performance of semantic segmentation model. In this work, we study the problem of semantic segmentation model in adapting to such hot climate cities. This issue can be circumvented by collecting and annotating images in such weather conditions and training segmentation models on those images. But the task of semantically annotating images for every environment is painstaking and expensive. Hence, we propose a framework that improves the performance of semantic segmentation models without explicitly creating an annotated dataset for such adverse weather variations. Our framework consists of two parts, a restoration network to remove the geometrical distortions caused by hot weather and an adaptive segmentation network that is trained on an additional loss to adapt to the statistics of the ground-truth segmentation map. We train our framework on the Cityscapes dataset, which showed a total IoU gain of 12.707 over standard segmentation models. We also observe that the segmentation results obtained by our framework gave a significant improvement for small classes such as poles, person, and rider, which are essential and valuable for autonomous navigation based applications.

Overview

Results

Related Publications

Shyam Nandan Rai, Vineeth N Balasubramanian, Anbumani Subramanian and C. V. Jawahar , Munich to Dubai: How far is it for Semantic Segmentation? , Winter Conference on Applications of Computer Vision (WACV 2020). [pdf], [Supp] and [code]

Please contact at This email address is being protected from spambots. You need JavaScript enabled to view it.

DeepHuMS: Deep Human Action Signature for3D Skeletal Sequences

[Video ]

Abstract:

3D Human Action Indexing and Retrieval is an interesting problem due to the rise of several data-driven applications aimed atanalyzing and/or reutilizing 3D human skeletal data, such as data-driven animation, analysis of sports bio-mechanics, human surveillance etc. Spatio-temporal articulations of humans, noisy/missing data, different speeds of the same action etc. make it challenging and several of the existing state of the art methods use hand-craft features along with optimization based or histogram based comparison in order to perform retrieval. Further, they demonstrate it only for very small datasets and few classes. We make a case for using a learned representation that should recognize the action as well as enforce a discriminative ranking. To that end, we propose, a 3D human action descriptor learned using a deep network. Our learned embedding is generalizable and applicable toreal-world data - addressing the aforementioned challenges and further enables sub-action searching in its embedding space using another network. Our model exploits the inter-class similarity using trajectory cues,and performs far superior in a self-supervised setting. State of the art results on all these fronts is shown on two large scale 3D human action datasets - NTU RGB+D and HDM05.

Method

In this paper, we attempt to solve the problem of indexing and retrieval of 3D skeletal videos. In order to build a 3D human action descriptor, we need to exploit the spatio-temporal features in the skeletal action data. Briefly, we have three key components - (i) the input skeletal location and joint level action trajectories to thenext frame, (ii) an RNN to model this temporal data and (iii) a novel trajectorybased similarity metric (explained below) to project similar content togetherusing a Siamese architecture. We use two setups to train our model - (a) self-supervised, with a ”contrastive loss” to train our Siamesemodel and (b) supervised setup, with a cross entropy on our embedding, in addition to the self-supervision.

Overview of our model - DeepHuMS. Given two skeleton sequences (a), we first extract the 3D joint locations and action field between consecutive frames to represent the spatio-temporal data (b). The two are concatenated together and given to an RNN to model the 4D data (c). The resulting embeddings (d) are compared based using (e) contrastive loss (and optionally classification loss) to make them ”discriminative” and ”recognition-robust”. Similarity is enforcedbased on the full sequence’s action distance and action field. At the time of retrieval, given a 3D sequence to the network (g), with the resultant embedding, a nearest neighbour search is done in the embedding space (f) generated from the training data.

Contributions:

We propose a novel deep learning model that makes use of trajectory cues,and optionally class labels, in order to build a discriminative and robust 3Dhuman action descriptor for retrieval.
Further, we perform sub-action search by learning a mapping from sub-sequences to longer sequences in the dataset by means of another network.
Experiments are performed, both, with and without class label supervision. We demonstrate our model’s ability to exploit the inter-class action similarity better in the unsupervised setting, thus, resulting in a more generalized solution.
Our model is learned on noisy/missing data as well as actions of different speeds and its robustness in such scenarios indicates its applicability to real world data.
A comparison of our retrieval performance with the publicly available stat eof the art in 3D action recognition as well as 3D action retrieval on two largescale publicly available datasets is done to demonstrate the state-of-the-art results of the proposed model.

Related Publication:

Neeraj Battan, Abbhinav Venkat, Avinash Sharma - DeepHuMS: Deep Human Action Signature for3D Skeletal Sequences, (ACPR 2019((Oral))). [pdf] , [Video] , [Code]

Surface Reconstruction

Recovering a 3D human body shape from a monocular image is an ill-posed problem in computer vision with great practical importance for many applications, including virtual and augmented reality platforms, animation industry, e-commerce domain, etc.

[Video]

HumanMeshNet: Polygonal Mesh Recovery of Humans

Abstract:

3D Human Body Reconstruction from a monocular image is an important problem in computer vision with applications in virtual and augmented reality platforms, animation industry, en-commerce domain, etc. While several of the existing works formulate it as a volumetric or parametric learning with complex and indirect reliance on re-projections of the mesh, we would like to focus on implicitly learning the mesh representation. To that end, we propose a novel model, HumanMeshNet, that regresses a template mesh's vertices, as well as receives a regularization by the 3D skeletal locations in a multi-branch, multi-task setup. The image to mesh vertex regression is further regularized by the neighborhood constraint imposed by mesh topology ensuring smooth surface reconstruction. The proposed paradigm can theoretically learn local surface deformations induced by body shape variations and can therefore learn high-resolution meshes going ahead. We show comparable performance with SoA (in terms of surface and joint error) with far lesser computational complexity, modeling cost and therefore real-time reconstructions on three publicly available datasets. We also show the generalizability of the proposed paradigm for a similar task of predicting hand mesh models. Given these initial results, we would like to exploit the mesh topology in an explicit manner going ahead.

Method

In this paper, we attempt to work in between a generic point cloud and a mesh - i.e., we learn an "implicitly structured" point cloud. Attempting to produce high resolution meshes are a natural extension that is easier in 3D space than in the parametric one. We present an initial solution in that direction - HumanMeshNet that simultaneously performs shape estimation by regressing to template mesh vertices (by minimizing surface loss) as well receives a body pose regularisation from a parallel branch in multi-task setup. Ours is a relatively simpler model as compared to the majority of the existing methods for volumetric and parametric model prediction (e.g.,Bodynet). This makes it efficient in terms of network size as well as feed forward time yielding significantly high frame-rate reconstructions. At the same time, our simpler network achieves comparable accuracy in terms of surface and joint error w.r.t. majority of state-of-the-art techniques on three publicly available datasets. The proposed paradigm can theoretically learn local surface deformations induced by body shape variations which the PCA space of parametric body models can't capture. In addition to predicting the body model, we also show the generalizability of our proposed idea for solving a similar task with different structure - non-rigid hand mesh reconstructions from a monocular image.

Pipeline

We we propose a novel model, HumanMeshNet which is a Multi-Task 3D Human Mesh Reconstruction Pipeline. Given a monocular RGB image (a), we first extract a body part-wise segmentation mask using Densepose (b). Then, using a joint embedding of both the RGB and segmentation mask (c), we predict the 3D joint locations (d) and the 3D mesh (e), in a multi-task setup. The 3D mesh is predicted by first applying a mesh regularizer on the predicted point cloud. Finally, the loss is minimized on both the branches (d) and (e).

Contributions:

We propose a simple end-to-end multi-branch, multi-task deep network that exploits a "structured point cloud" to recover a smooth and fixed topology mesh model from a monocular image.
The proposed paradigm can theoretically learn local surface deformations induced by body shape variations which the PCA space of parametric body models can't capture.
The simplicity of the model makes it efficient in terms of network size as well as feed forward time yielding significantly high frame-rate reconstructions, while simultaneously achieving comparable accuracy in terms of surface and joint error, as shown on three publicly available datasets.
We also show the generalizability of our proposed paradigm for a similar task of reconstructing the hand mesh models from a monocular image.

Related Publication:

Abbhinav Venkat, Chaitanya Patel, Yudhik Agrawal, Avinash Sharma - HumanMeshNet: Polygonal Mesh Recovery of Humans, (ICCV-3DRW 2019). [pdf], [Video], [Code]

IndicSpeech: Text-to-Speech Corpus for Indian Languages

[Dataset]

Abstract

Paper

IndicSpeech: Text-to-Speech Corpus for Indian Languages

Live Demo

Contact

RoadText-1K: Text Detection & Recognition Dataset for Driving Videos

Abstract

Related Publications

Dataset

Munich to Dubai: How far is it for Semantic Segmentation?

Abstract

Overview

Results

Related Publications

DeepHuMS: Deep Human Action Signature for3D Skeletal Sequences

Abstract:

Method

Contributions:

Related Publication:

Surface Reconstruction

HumanMeshNet: Polygonal Mesh Recovery of Humans

Abstract:

Method

Contributions:

Related Publication:

More Articles …