Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks


Abstract

Building robust text recognition systems for languages with cursive scripts like Urdu has always been challenging. Intricacies of the script and the absence of ample annotated data further act as adversaries to this task. We demonstrate the effectiveness of an end-to-end trainable hybrid CNN-RNN architecture in recognizing Urdu text from printed documents, typically known as Urdu OCR. The solution proposed is not bounded by any language specific lexicon with the model following a segmentation-free, sequence-tosequence transcription approach. The network transcribes a sequence of convolutional features from an input image to a sequence of target labels. This discards the need to segment the input image into its constituent characters/glyphs, which is often arduous for scripts like Urdu. Furthermore, past and future contexts modelled by bidirectional recurrent layers aids the transcription. We outperform previous state-of-theart techniques on the synthetic UPTI dataset. Additionally, we publish a new dataset curated by scanning printed Urdu publications in various writing styles and fonts, annotated at the line level. We also provide benchmark results of our model on this dataset.

 

#

Major Contributions

  • Establish new state-of-the-art for Urdu OCR.
  • Release IIIT-Urdu OCR Dataset and ascertain it's superiority in terms of complexity as compared to previous benchmark datasets.
  • Provide benchmark dataset and results for IIIT-Urdu OCR Dataset.

Related Publications

  • Mohit Jain, Minesh Mathew and C.V. Jawahar, Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks, 4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China, 2017. [PDF]

Downloads

  • IIIT-Urdu OCR Dataset
    • Curated by scanning multiple Urdu print books and magazines.
    • Contains 1610 Urdu OCR line images.
    • Images are annotated at the line level.

Download : Dataset ( 49.2 MBs ) || Readme


Bibtex

If you use this work or dataset, please cite :

 @inproceedings{jain2017unconstrained,
    title={Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks},
    author={Jain, Mohit and Mathew, Minesh and Jawahar, C.~V.},
    booktitle={4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China},
    pages={6},
    year={2017}
  }

Associated People