Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval


Siddhant Bansal   Praveen Krishnan   C.V.Jawahar  

DAS 2020

word

word

Recognition and retrieval of textual content from the large document collections have been a powerful use case for the document image analysis community. Often the word is the basic unit for recognition as well as retrieval. Systems that rely only on the text recogniser’s (OCR) output are not robust enough in many situations, especially when the word recognition rates are poor, as in the case of historic documents or digital libraries. An alternative has been word spotting based methods that retrieve/match words based on a holistic representation of the word. In this paper, we fuse the noisy output of text recogniser with a deep embeddings representation derived out of the entire word. We use average and max fusion for improving the ranked results in the case of retrieval. We validate our methods on a collection of Hindi documents. We improve word recognition rate by 1.4% and retrieval by 11.13% in the mAP.


Paper

  • Paper
    Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

    Siddhant Bansal, Praveen Krishnan and C.V. Jawahar
    DAS, 2020




    [Paper]       [Code]       [Demo]      


    Word Recognition in a nutshell

    word

    Word Retrieval in a nutshell

    word

    Results

    Word Recognition

    word

    Word Retrieval

    word

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis


Prajwal Renukanand*   Rudrabha Mukhopadhyay*   Vinay Namboodiri   C.V. Jawahar

IIIT Hyderabad       IIT Kanpur

CVPR 2020

[Code]   [Data]

Please click here to redirect to watch our video in Youtube.

word

In this work, we propose a sequence-to-sequence architecture for accurate speech generation from silent lip videos in unconstrained settings for the first time. The text in the bubble is manually transcribed and is shown for presentation purposes.

Abstract

Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose an approach to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is almost twice as intelligible as previous works in this space.


Paper

  • Paper
    Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

    Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
    Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis, CVPR, 2020 (Accepted).
    [PDF] |

    @InProceedings{Prajwal_2020_CVPR,
    author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
    title = {Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis},
    booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month = {June},
    year = {2020}
    }

Live Demo

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis (CVPR, 2020) : Please click here to redirect to Youtube.


Dataset

word

Our dataset contains lectures and chess commentary as of now.

We introduce a new benchmark dataset for unconstrained lip to speech synthesis that is tailored towards exploring the following line of thought: How accurately can we infer an individual’s speech style and content from his/her lip movements? To create this dataset, we collect a total of about 175 hours of talking face videos across 6 speakers. Our dataset is far more unconstrained and natural than older datasets like the GRID corpus and TIMIT dataset. All the corpuses are compared in the table given below.

word

Comparison of our dataset with other datasets which has been used earlier for video-to-speech generation

To access the dataset please click this link or the link given near the top of the page. We release the youtube ids of the videos used. In case, the videos are not present in Youtube, please contact us for an alternate link.

Architecture

word

Architecture for generating speech from lip movements

Our network consists of a spatio-temporal encoder and a attention based decoder. The spatio-temporal encoder takes multiple T frames as input and passes through a 3D CNN based encoder. We feed the output from 3D CNN based encoder to a attention based speech decoder to generate melspectrograms following the seq-to-seq paradigm. For more information about our model and different design choices we make please go through our paper.

Contact

  1. Prajwal K R - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

RoadText-1K: Text Detection & Recognition Dataset for Driving Videos


Abstract

Perceiving text is crucial to understand semantics of outdoor scenes and hence is a critical requirement to build intelligent systems for driver assistance and self-driving. Most of the existing datasets for text detection and recognition comprise still images and are mostly compiled keeping text in mind. This paper introduces a new "RoadText-1K" dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. Our dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame. State of the art methods for text detection, recognition and tracking are evaluated on the new dataset and the results signify the challenges in unconstrained driving videos compared to existing datasets. This suggests that RoadText-1K is suited for research and development of reading systems, robust enough to be incorporated into more complex downstream tasks like driver assistance and self-driving.

 road1k4


Related Publications

  • Sangeeth Reddy,Minesh Mathew, Lluis Gomez, Marçal Rusinol, Dimosthenis Karatzas and C. V. Jawahar ,RoadText-1K: Text Detection & Recognition Dataset for Driving Videos , International Conference on Robotics and Automation (ICRA 2020). [pdf]

Dataset

RoadText-1K Video: Link

RoadText-1K Dataset: Link

Please contact at This email address is being protected from spambots. You need JavaScript enabled to view it.


Text-to-Speech Dataset for Indian Languages


[Code]   [Code]

word

Word clouds of the collected corpus for 3 languages

Abstract

India is a country where several tens of languages are spoken by over a billion strong population. Text-to-speech systems for such languages will thus be extremely beneficial for wide-spread content creation and accessibility. Despite this, the current TTS systems for even the most popular Indian languages fall short of the contemporary state-of-the-art systems for English, Chinese, etc. We believe that one of the major reasons for this is the lack of large, publicly available text-to-speech corpora in these languages that are suitable for training neural text-to-speech systems. To mitigate this, we release a large text-to-speech corpus for $3$ major Indian languages namely Hindi, Malayalam and Bengali. In this work, we also train a state-of-the-art TTS system for each of these languages and report their performances.


Paper

  • Paper
    IndicSpeech: Text-to-Speech Corpus for Indian Languages

    Nimisha Srivastava, Rudrabha Mukhopadhyay*, Prajwal K R*, C.V. Jawahar
    IndicSpeech: Text-to-Speech Corpus for Indian Languages, LREC, 2020 (Accepted).
    [PDF] |

    WILL BE UPDATED AFTER PUBLICATION

Live Demo

--- We will update the link ---


Dataset Statistics

--- We will update the details ---

Contact

  1. Prajwal K R - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

Munich to Dubai: How far is it for Semantic Segmentation?


Abstract

Cities having hot weather conditions results in geometrical distortion, thereby adversely affecting the performance of semantic segmentation model. In this work, we study the problem of semantic segmentation model in adapting to such hot climate cities. This issue can be circumvented by collecting and annotating images in such weather conditions and training segmentation models on those images. But the task of semantically annotating images for every environment is painstaking and expensive. Hence, we propose a framework that improves the performance of semantic segmentation models without explicitly creating an annotated dataset for such adverse weather variations. Our framework consists of two parts, a restoration network to remove the geometrical distortions caused by hot weather and an adaptive segmentation network that is trained on an additional loss to adapt to the statistics of the ground-truth segmentation map. We train our framework on the Cityscapes dataset, which showed a total IoU gain of 12.707 over standard segmentation models. We also observe that the segmentation results obtained by our framework gave a significant improvement for small classes such as poles, person, and rider, which are essential and valuable for autonomous navigation based applications.

Overview

wacv-2020

Results

wacv-2020

Related Publications

  • Shyam Nandan Rai, Vineeth N Balasubramanian, Anbumani Subramanian and C. V. Jawahar , Munich to Dubai: How far is it for Semantic Segmentation? , Winter Conference on Applications of Computer Vision (WACV 2020). [pdf], [Supp] and [code]

Please contact at This email address is being protected from spambots. You need JavaScript enabled to view it.