Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks


Abstract

Building robust text recognition systems for languages with cursive scripts like Urdu has always been challenging. Intricacies of the script and the absence of ample annotated data further act as adversaries to this task. We demonstrate the effectiveness of an end-to-end trainable hybrid CNN-RNN architecture in recognizing Urdu text from printed documents, typically known as Urdu OCR. The solution proposed is not bounded by any language specific lexicon with the model following a segmentation-free, sequence-tosequence transcription approach. The network transcribes a sequence of convolutional features from an input image to a sequence of target labels. This discards the need to segment the input image into its constituent characters/glyphs, which is often arduous for scripts like Urdu. Furthermore, past and future contexts modelled by bidirectional recurrent layers aids the transcription. We outperform previous state-of-theart techniques on the synthetic UPTI dataset. Additionally, we publish a new dataset curated by scanning printed Urdu publications in various writing styles and fonts, annotated at the line level. We also provide benchmark results of our model on this dataset.

 

#

Major Contributions

  • Establish new state-of-the-art for Urdu OCR.
  • Release IIIT-Urdu OCR Dataset and ascertain it's superiority in terms of complexity as compared to previous benchmark datasets.
  • Provide benchmark dataset and results for IIIT-Urdu OCR Dataset.

Related Publications

  • Mohit Jain, Minesh Mathew and C.V. Jawahar, Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks, 4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China, 2017. [PDF]

Downloads

  • IIIT-Urdu OCR Dataset
    • Curated by scanning multiple Urdu print books and magazines.
    • Contains 1610 Urdu OCR line images.
    • Images are annotated at the line level.

Download : Dataset ( 49.2 MBs ) || Readme


Bibtex

If you use this work or dataset, please cite :

 @inproceedings{jain2017unconstrained,
    title={Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks},
    author={Jain, Mohit and Mathew, Minesh and Jawahar, C.~V.},
    booktitle={4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China},
    pages={6},
    year={2017}
  }

Associated People


Unconstrained Scene Text and Video Text Recognition for Arabic Script


Abstract

Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing Arabic text in videos and natural scenes. We outperform previous state-of-the-art on two publicly available video text datasets - ALIF and AcTiV. For the scene text recognition task, we introduce a new Arabic scene text dataset and establish baseline results. For scripts like Arabic, a major challenge in developing robust recognizers is the lack of large quantity of annotated data. We overcome this by synthesizing millions of Arabic text images from a large vocabulary of Arabic words and phrases. Our implementation is built on top of the CRNN model which is proven quite effective for English scene text recognition. The model follows a segmentation-free, sequence to sequence transcription approach. The network transcribes a sequence of convolutional features from the input image to a sequence of target labels. This does away with the need for segmenting input image into constituent characters/glyphs, which is often difficult for Arabic script. Further, the ability of RNNs to model contextual dependencies yields superior recognition results.

 

Fig 1: Examples of Arabic Scene Text and Video Text recognized by our model.
Our work deals only with the recognition of cropped words/lines. The bounding boxes were provided manually.

Major Contributions

  • Establish new state-of-the-art for Arabic Video Text Recognition.
  • Provide benchmark for Arabic Scene Text Recognition.
  • Release new dataset (IIIT-Arabic) for Arabic Scene Text Recognition task.

Related Publications

  • Mohit Jain, Minesh Mathew and C.V. Jawahar, Unconstrained Scene Text and Video Text Recognition for Arabic Script, Proceedings of 1st International Workshop on Arabic Script Analysis and Recognition (ASAR), Nancy, France, 2017. [PDF]

Downloads

  • IIIT-Arabic Dataset
    • Curated by downloading freely available images containing Arabic script from Google Images.
    • Contains 306 full images containing Arabic and English script.
    • The full images are annotated at the word-level rendering 2198 Arabic and 2139 English word images.

Download : Dataset (108.2 MBs) || Readme


Bibtex

If you use this work or dataset, please cite :

@InProceedings{JainASAR17,
  author    = "Jain, M., Mathew, M. and Jawahar, C.~V.",
  title     = "Unconstrained Scene Text and Video Text Recognition for Arabic Script",
  booktitle = "1st International Workshop on Arabic Script Analysis and Recognition, Nancy, France",
  year      = "2017",
}

Associated People


From Traditional to Modern: Domain Adaptation for Action Classification in Short Social Video Clips


Abstract

Short internet video clips like vines present a significantly wild distribution compared to traditional video datasets. In this paper, we focus on the problem of unsupervised action classification in wild vines using traditional labeled datasets. To this end, we use a data augmentation based simple domain adaptation strategy. We utilise semantic word2vec space as a common subspace to embed video features from both, labeled source domain and unlabelled target domain. Our method incrementally augments the labeled source with target samples and iteratively modifies the embedding function to bring the source and target distributions together. Additionally, we utilise a multi-modal representation that incorporates noisy semantic information available in form of hash-tags. We show the effectiveness of this simple adaptation technique on a test set of vines and achieve notable improvements in performance.


Challenges

The distribution of video are targeting is vine.co. These are recorded by the users under unconstrained environment. These videos contain significant camera shakes, lighting variability, abrupt shots etc. The challenge is to utilise an existing dataset for action videos to significantly gather relevant vines without investing manual labour.


Another set of challenge is to merge visual, textual and hash-tag information of a vine to perform the above stated segregation.


Contribution

  • We attempt to solve the problem of classifying action in vines by adapting classifiers trained for source dataset to target dataset. This we perform by iteratively selecting high confidence videos and modifying the learnt embedding function.
  • We also provide the 3000 vine videos used in our work along with their hash-tags as provided by the uploaders.

 

flowchart gcpr page 001


Dataset

 

Name Link Description
Code.tar Link Code for running the main classification along with other utility programs
Vine.tar Link Download link for vines
Paper.pdf Link GCPR submission
Supplementary Link Supplementary Submission

Results


 

perfTable

 

 

img2


Associated People

Pose-Aware Person Recognition
Vijay Kumar1, Anoop Namboodiri1, Manohar Paluri2 and C V Jawahar1
1Center for Visual Information Technology, IIIT Hyderabad
2Facebook AI Research
CVPR 2017

Abstract

Person recognition methods that use multiple body regions have shown significant improvements over traditional face-based recognition. One of the primary challenges in full-body person recognition is the extreme variation in pose and view point. In this work, (i) we present an approach that tackles pose variations utilizing multiple models that are trained on specific poses, and combined using pose-aware weights during testing. (ii) For learning a person representation, we propose a network that jointly optimizes a single loss over multiple body regions. (iii) Finally, we introduce new benchmarks to evaluate person recognition in diverse scenarios and show significant improvements over previously proposed approaches on all the benchmarks including the photo album setting of PIPA.

Links

Paper

Datasets

PIPA: Dataset   Pose Annotations

IMDB: Dataset

Hannah: Dataset

Soccer: Dataset

Software

code   models

Citation

@InProceedings{vijaycvpr15,
  author    = "Vijay Kumar and Anoop Namboodiri and and Manohar Paluri and Jawahar, C.~V.",
  title     = "Pose-Aware Person Recognition",
  booktitle = "Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition",
  year      = "2017"
}

References

1. N. Zhang et al., Beyond Fronta Faces: Improving Person Recognition using Multiple Cues, CVPR 2014.

2. Oh et al., Person Recognition in Personal Photo Collections, ICCV 2015.

3. Li et al., A Multi-lvel Contextual Model for Person Recognition in Photo Albums, CVPR 2016.

4. Ozerov et al., On Evaluating Face Tracks in Movies, ICIP 2013.

Acknowledgements

Vijay Kumar is partly supported by TCS PhD Fellowship 2012.

Contact

For any comments and suggestions, please email Vijay at This email address is being protected from spambots. You need JavaScript enabled to view it.

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.

Learning to Hash-tag Videos with Tag2Vec

 

img1


Abstract

User-given tags or labels are valuable resources for semanticunderstanding of visual media such as images and videos. Recently, a new type of labeling mechanism known as hashtagshave become increasingly popular on social media sites. In this paper, we study the problem of generating relevantand useful hash-tags for short video clips. Traditional data driven approaches for tag enrichment and recommendationuse direct visual similarity for label transfer and propagation. We attempt to learn a direct low-cost mapping fromvideo to hash-tags using a two step training. We first employa natural language processing (NLP) technique, Skiagram models with neural network training to learn a low dimensional vector representation of hash-tags (Tag2Vec)using a corpus of ∼10 million hash-tags. We then trainan embedding function to map video features to the low dimensional Tag2vec space. We learn this embedding for 29categories of short video clips withhash-tags. A query videowithout any tag-information can then be directly mappedto the vector space of tags using the learned embedding andrelevant tags can be found by performing a simple nearestneighbor retrieval in the Tag2Vec space. We validate therelevance of the tags suggested by our system qualitativelyand quantitatively with user study.


Aim

The distribution of video we are targeting is vine.co. These are recorded by the users under unconstrained environment. These videos contain significant camera shakes, lighting variability, abrupt shots etc. We try to use the folksonomies associated by the uploaders to create a vector space. This Tag2Vec space is trained using ~2million hash tags downloaded for ~15000 categories. The main motivation to create a plug and use system for categorising vines.

img2


Contribution

  • We create a system for easily tagging vines
  • We provide training sentences comprised of hashtags, and also vines+hash tags for test categories

Dataset

Name Link Description
Code.tar Link Code for running the main classification along with other utility programs
Vine.tar Link Download link for vines
HashTags.tar Link Training Hashtags
Paper.pdf Link GCPR submission
Supplementary Link Supplementary Submission

 


Approach

img3


Qualitative Results

We conduct a user study where a user is presented with a video and 15 suggested tags by our system. The user marks the relevant tags, in the end we compute the average number of relevant tags across classes.

 img4

 

img5


Associated People