Unconstrained Arabic & Urdu Text Recognition using Deep CNN-RNN Hybrid Networks
Mohit Jain (Home Page)
We demonstrate the effectiveness of an end-to-end trainable hybrid CNN - RNN architecture in recog- nizing Urdu text from printed documents, typically known as Urdu OCR , and from Arabic text embedded in videos and natural scenes. When dealing with low-resource languages like Arabic and Urdu, a major adversary in developing a robust recognizer is the lack of large quantity of annotated data. We overcome this problem by synthesizing millions of images from a large vocabulary of words and phrases scraped from Wikipedia’s Arabic and Urdu versions, using a wide variety of fonts downloaded from various online resources. Building robust recognizers for Arabic and Urdu text has always been a challenging task. Though a lot of research has been done in the field of text recognition, the focus of the vision community has been primarily on English. While, Arabic script has started to receive some spotlight as far as text recognition is concerned, works on other languages which use the Nabatean family of scripts, like Urdu and Persian, are very limited. Moreover, the quality of the works presented in this field generally lack a standardized structure making it hard to reproduce and verify the claims or results. This is quite surprising considering the fact that Arabic is the fifth most spoken language in the world after Chinese, English, Spanish and Hindi catering to 4.7% of the world’s population, while Urdu has over a 100 million speakers and is spoken widely in Pakistan, where it is the national language, and India where it is recognized as one of the 22 official languages. In this thesis, we introduce the problems related with text recognition of low-resource languages, namely Arabic and Urdu, in various scenarios. We propose a language independent Hybrid CNN - RNN architecture which can be trained in an end-to-end fashion and prove it’s dominance over simple RNN based methods. Moreover, we dive deeper into the working of its convolutional layers and verify the robustness of convolutional-features through layer visualizations. We also propose a method to synthe- size artificial text images to do away with the need of annotating large amounts of training data. We outperform previous state-of-the-art methods on existing benchmarks by quite some margin and release two new benchmark datasets for Arabic Scene Text and Urdu Printed Text Recognition to instill interest among fellow researchers of the field.
|Year of completion:
|Prof. C.V. Jawahar
Mohit Jain, Minesh Mathew and C. V. Jawahar - Unconstrained Scene Text and Video Text Recognition for Arabic Script 1st International Workshop on Arabic Script Analysis and Recognition (ASAR 2017), Nancy, France, 2017. [PDF]