CVIT Projects

An OCR for Classical Indic Documents Containing Arbitrarily Long Words

Abstract

OCR for printed classical Indic documents written inSanskrit is a challenging research problem. It involves com-plexities such as image degradation, lack of datasets andlong-length words. Due to these challenges, the word ac-curacy of available OCR systems, both academic and in-dustrial, is not very high for such documents. To addressthese shortcomings, we develop a Sanskrit specific OCRsystem. We present an attention-based LSTM model forreading Sanskrit characters in line images. We introduce adataset of Sanskrit document images annotated at line level.To augment real data and enable high performance for ourOCR, we also generate synthetic data via curated font se-lection and rendering designed to incorporate crucial glyphsubstitution rules. Consequently, our OCR achieves a worderror rate of 15.97% and a character error rate of 3.71%on challenging Indic document texts and outperforms strongbaselines. Overall, our contributions set the stage for ap-plication of OCRs on large corpora of classic Sanskrit textscontaining arbitrarily long and highly conjoined words.

To access the code and paper click here

Bibtex

If you find our work useful in your research, please consider citing:

 
@InProceedings{Dwivedi_2020_CVPR_Workshops,
author = {Dwivedi, Agam and Saluja, Rohit and Kiran Sarvadevabhatla, Ravi},
title = {An OCR for Classical Indic Documents Containing Arbitrarily Long Words},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2020}
}

Topological Mapping for Manhattan-like Repetitive Environments

Abstract

We showcase a topological mapping frameworkfor a challenging indoor warehouse setting. At the most abstractlevel, the warehouse is represented as a Topological Graphwhere the nodes of the graph represent a particular warehousetopological construct (e.g. rackspace, corridor) and the edgesdenote the existence of a path between two neighbouringnodes or topologies. At the intermediate level, the map isrepresented as a Manhattan Graph where the nodes and edgesare characterized by Manhattan properties and as a Pose Graphat the lower-most level of detail. The topological constructsare learned via a Deep Convolutional Network while therelational properties between topological instances are learntvia a Siamese-style Neural Network. In the paper, we showthat maintaining abstractions such as Topological Graph andManhattan Graph help in recovering an accurate Pose Graphstarting from a highly erroneous and unoptimized Pose Graph.We show how this is achieved by embedding topological andManhattan relations as well as Manhattan Graph aided loopclosure relations as constraints in the backend Pose Graphoptimization framework. The recovery of near ground-truthPose Graph on real-world indoor warehouse scenes vindicatethe efficacy of the proposed framework.

Introduction

We showcase a topological mapping framework for a challenging indoor warehouse setting. At the most abstract level, the warehouse is represented as a Topological Graph where the nodes of the graph represent a particular warehouse topological construct (e.g. rackspace, corridor) and the edges denote the existence of a path between two neighbouring nodes or topologies. At the intermediate level, the map is represented as a Manhattan Graph where the nodes and edges are characterized by Manhattan properties and as a Pose Graph at the lower-most level of detail. The topological constructs are learned via a Deep Convolutional Network while the relational properties between topological instances are learnt via a Siamese-style Neural Network. In the paper, we show that maintaining abstractions such as Topological Graph and Manhattan Graph help in recovering an accurate Pose Graph starting from a highly erroneous and unoptimized Pose Graph. We show how this is achieved by embedding topological and Manhattan relations as well as Manhattan Graph aided loop closure relations as constraints in the backend Pose Graph optimization framework. The recovery of near ground-truth Pose Graph on real-world indoor warehouse scenes vindicate the efficacy of the proposed framework.

Qualitative Results:

1)RTABMAP SLAM

Fig. a shows registered map generated by RTABMAP Slam. Fig. b shows RTABMAP trajectory with topological labels. Fig. c compares the RTABMAP trajectory with groundtruth trajectory. Fig. d compares trajectory generated using our topological SLAM pipeline with groundtruth.

2)RTABMAP as Visual Odometry pipeline

Fig. a shows trajectory obtained using RTABMAP with loop closure turn off. Wheel odometry is used as odometry source. Fig. b compares RTABMAP trajectory with groundtruth. Fig. c compares trajectory obtained using Topological Slam pipeline with groundtruth.

Code:

Our pipeline consists of 3 parts - each sub-folder in this repo containts code for each:

Topological categorization using a convolutional neural network classifier -> Topological Classifier
Predicting loop closure constraints using Multi-Layer Perceptron -> Instance Comparator
Graph construction and pose graph optimization using obtained Manhattan and Loop Closure Constraints -> Pose Graph Optimizer

How to use each is explained in corresponding sub-folder
Please find GitHub Project Page

Bibtex

If you find our work useful in your research, please consider citing:

 @article{ puligilla2020topo, 
author = { Puligilla, Sai Shubodh and Tourani, Satyajit and Vaidya, Tushar and Singh Parihar, Udit and Sarvadevabhatla, Ravi Kiran and Krishna, Madhava }, 
title = { Topological Mapping for Manhattan-Like Repetitive Environments }, 
journal = { ICRA }, 
year = { 2020 }, 
}

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

Prajwal Renukanand* Rudrabha Mukhopadhyay* Vinay Namboodiri C.V. Jawahar

IIIT Hyderabad Univ. of Bath

[Code] [Interactive Demo] [Demo Video] [ReSyncED]

We propose a novel approach that achieves significantly more accurate lip-synchronization (A) in dynamic, unconstrained talking face videos. In contrast, we can see that the corresponding lip shapes generated by the current best model (B) is out-of-sync with the spoken utterances (shown at the bottom).

Abstract

In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or on videos of specific people seen during the training phase. However, they fail to accurately morph the actual lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the newly chosen audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to specifically measure the accuracy of lip synchronization in unconstrained videos. Extensive quantitative and human evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated using our Wav2Lip model is almost as good as real synced videos. We clearly demonstrate the substantial impact of our Wav2Lip model in our publicly available demo video. We also open-source our code, models, and evaluation benchmarks to promote future research efforts in this space.

Paper

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild, ACM Multimedia, 2020 .
[PDF] | [BibTeX]

@misc{prajwal2020lip,
title={A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild},
author={K R Prajwal and Rudrabha Mukhopadhyay and Vinay Namboodiri and C V Jawahar},
year={2020},
eprint={2008.10010},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

Live Demo

Please click here for the live demo : https://www.youtube.com/embed/0fXaDCZNOJc

Architecture

Architecture for generating speech from lip movements

Our approach generates accurate lip-sync by learning from an ``already well-trained lip-sync expert". Unlike previous works that employ only a reconstruction loss or train a discriminator in a GAN setup, we use a pre-trained discriminator that is already quite accurate at detecting lip-sync errors. We show that fine-tuning it further on the noisy generated faces hampers the discriminator's ability to measure lip-sync, thus also affecting the generated lip shapes.

Ethical Use

To ensure fair use, we strongly require that any result created using this our algorithm must unambiguously present itself as synthetic and that it is generated using the Wav2Lip model. In addition, to the strong positive applications of this work, our intention to completely open-source our work is that it can simultaneously also encourage efforts in detecting manipulated video content and their misuse. We believe that Wav2Lip can enable several positive applications and also encourage productive discussions and research efforts regarding fair use of synthetic content.

Contact

Prajwal K R - This email address is being protected from spambots. You need JavaScript enabled to view it.
Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Siddhant Bansal Praveen Krishnan C.V.Jawahar

DAS 2020

Recognition and retrieval of textual content from the large document collections have been a powerful use case for the document image analysis community. Often the word is the basic unit for recognition as well as retrieval. Systems that rely only on the text recogniserâ€™s (OCR) output are not robust enough in many situations, especially when the word recognition rates are poor, as in the case of historic documents or digital libraries. An alternative has been word spotting based methods that retrieve/match words based on a holistic representation of the word. In this paper, we fuse the noisy output of text recogniser with a deep embeddings representation derived out of the entire word. We use average and max fusion for improving the ranked results in the case of retrieval. We validate our methods on a collection of Hindi documents. We improve word recognition rate by 1.4% and retrieval by 11.13% in the mAP.

Paper

Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Siddhant Bansal, Praveen Krishnan and C.V. Jawahar
DAS, 2020

[Paper] [Code] [Demo]

Word Recognition in a nutshell

Word Retrieval in a nutshell

Results

Word Recognition

Word Retrieval

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Prajwal Renukanand* Rudrabha Mukhopadhyay* Vinay Namboodiri C.V. Jawahar

IIIT Hyderabad IIT Kanpur

CVPR 2020

[Code ] [Data]

Please click here to redirect to watch our video in Youtube.

In this work, we propose a sequence-to-sequence architecture for accurate speech generation from silent lip videos in unconstrained settings for the first time. The text in the bubble is manually transcribed and is shown for presentation purposes.

Abstract

Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose an approach to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is almost twice as intelligible as previous works in this space.

Paper

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis, CVPR, 2020 (Accepted).
[PDF ] | [BibTeX]

@InProceedings{Prajwal_2020_CVPR,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}

Live Demo

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis (CVPR, 2020) : Please click here to redirect to Youtube.

Dataset

Our dataset contains lectures and chess commentary as of now.

We introduce a new benchmark dataset for unconstrained lip to speech synthesis that is tailored towards exploring the following line of thought: How accurately can we infer an individual’s speech style and content from his/her lip movements? To create this dataset, we collect a total of about 175 hours of talking face videos across 6 speakers. Our dataset is far more unconstrained and natural than older datasets like the GRID corpus and TIMIT dataset. All the corpuses are compared in the table given below.

Comparison of our dataset with other datasets which has been used earlier for video-to-speech generation

To access the dataset please click this link or the link given near the top of the page. We release the youtube ids of the videos used. In case, the videos are not present in Youtube, please contact us for an alternate link.

Architecture

Architecture for generating speech from lip movements

Our network consists of a spatio-temporal encoder and a attention based decoder. The spatio-temporal encoder takes multiple T frames as input and passes through a 3D CNN based encoder. We feed the output from 3D CNN based encoder to a attention based speech decoder to generate melspectrograms following the seq-to-seq paradigm. For more information about our model and different design choices we make please go through our paper.

Contact

Prajwal K R - This email address is being protected from spambots. You need JavaScript enabled to view it.
Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

An OCR for Classical Indic Documents Containing Arbitrarily Long Words

Abstract

Bibtex

Topological Mapping for Manhattan-like Repetitive Environments

Abstract

Introduction

Qualitative Results:

Code:

Bibtex

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

Prajwal Renukanand* Rudrabha Mukhopadhyay* Vinay Namboodiri C.V. Jawahar

IIIT Hyderabad Univ. of Bath

[Code] [Interactive Demo] [Demo Video] [ReSyncED]

Abstract

Paper

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

Live Demo

Architecture

Ethical Use

Contact

Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Siddhant Bansal Praveen Krishnan C.V.Jawahar

DAS 2020

Paper

Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval

Word Recognition in a nutshell

Word Retrieval in a nutshell

Results

Word Recognition

Word Retrieval

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Prajwal Renukanand* Rudrabha Mukhopadhyay* Vinay Namboodiri C.V. Jawahar

IIIT Hyderabad IIT Kanpur

CVPR 2020

[Code] [Data]

Abstract

Paper

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Live Demo

Dataset

Architecture

Contact

More Articles …

[Code ] [Data]