RoadText-1K: Text Detection & Recognition Dataset for Driving Videos


Perceiving text is crucial to understand semantics of outdoor scenes and hence is a critical requirement to build intelligent systems for driver assistance and self-driving. Most of the existing datasets for text detection and recognition comprise still images and are mostly compiled keeping text in mind. This paper introduces a new "RoadText-1K" dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. Our dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame. State of the art methods for text detection, recognition and tracking are evaluated on the new dataset and the results signify the challenges in unconstrained driving videos compared to existing datasets. This suggests that RoadText-1K is suited for research and development of reading systems, robust enough to be incorporated into more complex downstream tasks like driver assistance and self-driving.


Related Publications

  • Sangeeth Reddy,Minesh Mathew, Lluis Gomez, Marçal Rusinol, Dimosthenis Karatzas and C. V. Jawahar ,RoadText-1K: Text Detection & Recognition Dataset for Driving Videos , International Conference on Robotics and Automation (ICRA 2020). [pdf]


RoadText-1K Video: Link

RoadText-1K Dataset: Link

Please contact at This email address is being protected from spambots. You need JavaScript enabled to view it.

Munich to Dubai: How far is it for Semantic Segmentation?


Cities having hot weather conditions results in geometrical distortion, thereby adversely affecting the performance of semantic segmentation model. In this work, we study the problem of semantic segmentation model in adapting to such hot climate cities. This issue can be circumvented by collecting and annotating images in such weather conditions and training segmentation models on those images. But the task of semantically annotating images for every environment is painstaking and expensive. Hence, we propose a framework that improves the performance of semantic segmentation models without explicitly creating an annotated dataset for such adverse weather variations. Our framework consists of two parts, a restoration network to remove the geometrical distortions caused by hot weather and an adaptive segmentation network that is trained on an additional loss to adapt to the statistics of the ground-truth segmentation map. We train our framework on the Cityscapes dataset, which showed a total IoU gain of 12.707 over standard segmentation models. We also observe that the segmentation results obtained by our framework gave a significant improvement for small classes such as poles, person, and rider, which are essential and valuable for autonomous navigation based applications.





Related Publications

  • Shyam Nandan Rai, Vineeth N Balasubramanian, Anbumani Subramanian and C. V. Jawahar , Munich to Dubai: How far is it for Semantic Segmentation? , Winter Conference on Applications of Computer Vision (WACV 2020). [pdf], [Supp] and [code]

Please contact at This email address is being protected from spambots. You need JavaScript enabled to view it.

Surface Reconstruction

Recovering a 3D human body shape from a monocular image is an ill-posed problem in computer vision with great practical importance for many applications, including virtual and augmented reality platforms, animation industry, e-commerce domain, etc.


HumanMeshNet: Polygonal Mesh Recovery of Humans


3D Human Body Reconstruction from a monocular image is an important problem in computer vision with applications in virtual and augmented reality platforms, animation industry, en-commerce domain, etc. While several of the existing works formulate it as a volumetric or parametric learning with complex and indirect reliance on re-projections of the mesh, we would like to focus on implicitly learning the mesh representation. To that end, we propose a novel model, HumanMeshNet, that regresses a template mesh's vertices, as well as receives a regularization by the 3D skeletal locations in a multi-branch, multi-task setup. The image to mesh vertex regression is further regularized by the neighborhood constraint imposed by mesh topology ensuring smooth surface reconstruction. The proposed paradigm can theoretically learn local surface deformations induced by body shape variations and can therefore learn high-resolution meshes going ahead. We show comparable performance with SoA (in terms of surface and joint error) with far lesser computational complexity, modeling cost and therefore real-time reconstructions on three publicly available datasets. We also show the generalizability of the proposed paradigm for a similar task of predicting hand mesh models. Given these initial results, we would like to exploit the mesh topology in an explicit manner going ahead.


In this paper, we attempt to work in between a generic point cloud and a mesh - i.e., we learn an "implicitly structured" point cloud. Attempting to produce high resolution meshes are a natural extension that is easier in 3D space than in the parametric one. We present an initial solution in that direction - HumanMeshNet that simultaneously performs shape estimation by regressing to template mesh vertices (by minimizing surface loss) as well receives a body pose regularisation from a parallel branch in multi-task setup. Ours is a relatively simpler model as compared to the majority of the existing methods for volumetric and parametric model prediction (e.g.,Bodynet). This makes it efficient in terms of network size as well as feed forward time yielding significantly high frame-rate reconstructions. At the same time, our simpler network achieves comparable accuracy in terms of surface and joint error w.r.t. majority of state-of-the-art techniques on three publicly available datasets. The proposed paradigm can theoretically learn local surface deformations induced by body shape variations which the PCA space of parametric body models can't capture. In addition to predicting the body model, we also show the generalizability of our proposed idea for solving a similar task with different structure - non-rigid hand mesh reconstructions from a monocular image.


We we propose a novel model, HumanMeshNet which is a Multi-Task 3D Human Mesh Reconstruction Pipeline. Given a monocular RGB image (a), we first extract a body part-wise segmentation mask using Densepose (b). Then, using a joint embedding of both the RGB and segmentation mask (c), we predict the 3D joint locations (d) and the 3D mesh (e), in a multi-task setup. The 3D mesh is predicted by first applying a mesh regularizer on the predicted point cloud. Finally, the loss is minimized on both the branches (d) and (e).


  • We propose a simple end-to-end multi-branch, multi-task deep network that exploits a "structured point cloud" to recover a smooth and fixed topology mesh model from a monocular image.
  • The proposed paradigm can theoretically learn local surface deformations induced by body shape variations which the PCA space of parametric body models can't capture.
  • The simplicity of the model makes it efficient in terms of network size as well as feed forward time yielding significantly high frame-rate reconstructions, while simultaneously achieving comparable accuracy in terms of surface and joint error, as shown on three publicly available datasets.
  • We also show the generalizability of our proposed paradigm for a similar task of reconstructing the hand mesh models from a monocular image.

Related Publication:

  • Abbhinav Venkat, Chaitanya Patel, Yudhik Agrawal, Avinash Sharma - HumanMeshNet: Polygonal Mesh Recovery of Humans, (ICCV-3DRW 2019). [pdf], [Video], [Code]

DeepHuMS: Deep Human Action Signature for3D Skeletal Sequences

[Video ]


3D Human Action Indexing and Retrieval is an interesting problem due to the rise of several data-driven applications aimed atanalyzing and/or reutilizing 3D human skeletal data, such as data-driven animation, analysis of sports bio-mechanics, human surveillance etc. Spatio-temporal articulations of humans, noisy/missing data, different speeds of the same action etc. make it challenging and several of the existing state of the art methods use hand-craft features along with optimization based or histogram based comparison in order to perform retrieval. Further, they demonstrate it only for very small datasets and few classes. We make a case for using a learned representation that should recognize the action as well as enforce a discriminative ranking. To that end, we propose, a 3D human action descriptor learned using a deep network. Our learned embedding is generalizable and applicable toreal-world data - addressing the aforementioned challenges and further enables sub-action searching in its embedding space using another network. Our model exploits the inter-class similarity using trajectory cues,and performs far superior in a self-supervised setting. State of the art results on all these fronts is shown on two large scale 3D human action datasets - NTU RGB+D and HDM05.


In this paper, we attempt to solve the problem of indexing and retrieval of 3D skeletal videos. In order to build a 3D human action descriptor, we need to exploit the spatio-temporal features in the skeletal action data. Briefly, we have three key components - (i) the input skeletal location and joint level action trajectories to thenext frame, (ii) an RNN to model this temporal data and (iii) a novel trajectorybased similarity metric (explained below) to project similar content togetherusing a Siamese architecture. We use two setups to train our model - (a) self-supervised, with a ”contrastive loss” to train our Siamesemodel and (b) supervised setup, with a cross entropy on our embedding, in addition to the self-supervision.

Overview of our model - DeepHuMS. Given two skeleton sequences (a), we first extract the 3D joint locations and action field between consecutive frames to represent the spatio-temporal data (b). The two are concatenated together and given to an RNN to model the 4D data (c). The resulting embeddings (d) are compared based using (e) contrastive loss (and optionally classification loss) to make them ”discriminative” and ”recognition-robust”. Similarity is enforcedbased on the full sequence’s action distance and action field. At the time of retrieval, given a 3D sequence to the network (g), with the resultant embedding, a nearest neighbour search is done in the embedding space (f) generated from the training data.


  • We propose a novel deep learning model that makes use of trajectory cues,and optionally class labels, in order to build a discriminative and robust 3Dhuman action descriptor for retrieval.
  • Further, we perform sub-action search by learning a mapping from sub-sequences to longer sequences in the dataset by means of another network.
  • Experiments are performed, both, with and without class label supervision. We demonstrate our model’s ability to exploit the inter-class action similarity better in the unsupervised setting, thus, resulting in a more generalized solution.
  • Our model is learned on noisy/missing data as well as actions of different speeds and its robustness in such scenarios indicates its applicability to real world data.
  • A comparison of our retrieval performance with the publicly available stat eof the art in 3D action recognition as well as 3D action retrieval on two largescale publicly available datasets is done to demonstrate the state-of-the-art results of the proposed model.

Related Publication:

  • Neeraj Battan, Abbhinav Venkat, Avinash Sharma - DeepHuMS: Deep Human Action Signature for3D Skeletal Sequences, (ACPR 2019((Oral))). [pdf] , [Video] , [Code]

Towards Automatic Face-to-Face Translation

ACM Multimedia 2019


F2FT banner

Given a speaker speaking in a language L$_A$ (Hindi in this case), our fully-automated system generates a video of the speaker speaking in L$_B$ (English). Here, we illustrate a potential real-world application of such a system where two people can engage in a natural conversation in their own respective languages.


In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach of what we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is the need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set, shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages.


  • Face-to-Face Paper
    Towards Automatic Face-to-Face Translation

    Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Jerin Philip, Abhishek Jha, Vinay Namboodiri and C.V. Jawahar
    Towards Automatic Face-to-Face Translation, ACM Multimedia, 2019.
    [PDF] |

      author = {K R, Prajwal and Mukhopadhyay, Rudrabha and Philip, Jerin and Jha, Abhishek and Namboodiri, Vinay and Jawahar, C V},
      title = {Towards Automatic Face-to-Face Translation},
      booktitle = {Proceedings of the 27th ACM International Conference on Multimedia},
      series = {MM '19},
      year = {2019},
      isbn = {978-1-4503-6889-6},
      location = {Nice, France},
       = {1428--1436},
      numpages = {9},
      url = {},
      doi = {10.1145/3343031.3351066},
      acmid = {3351066},
      publisher = {ACM},
      address = {New York, NY, USA},
      keywords = {cross-language talking face generation, lip synthesis, neural machine translation, speech to speech translation, translation systems, voice transfer},
    } }


Click here to redirect to the video

Speech-to-Speech Translation


Pipeline for Speech-to-Speech Translation

Our system can be widely divided into two sub-systems, (a) Speech-to-Speech Translation and (b) Lip Synthesis. We do speech-to-speech translation by combining ASR, NMT and TTS. We first use a publicly available ASR to get the text transcript. For English we use DeepSpeech for transcribing English text from audio. We use a suitable publicly available ASR for other languages like Hindi and French. We train our own NMT system for different Indian languages using Facebook AI Research's publicly available codebase. We finally train a TTS for each language of our choice. The TTS is used to generate speech in the target language.

Synthesizing Talking Faces from Speech


Pipeline for generating talking faces of any identity given a speech segment

We create a novel model called LipGAN which can generate talking faces of any person given a speech segment. The model comprises of two encoders, (a) Face encoder and (b) Speech encoder. The face encoder is used to encode information about the identity of the talking face. The speech encoder takes a very small speech segment (350 ms of audio at a time) and is used to encode the audio information. The outputs from both of these encoders are then fed to a decoder which generates a face image of the given identity which matches the lip shape corresponding to the given audio segment. For more information about our model please go through the paper.