CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Unconstrained Arabic & Urdu Text Recognition using Deep CNN-RNN Hybrid Networks


Mohit Jain (Home Page)

Abstract

We demonstrate the effectiveness of an end-to-end trainable hybrid CNN - RNN architecture in recog- nizing Urdu text from printed documents, typically known as Urdu OCR , and from Arabic text embedded in videos and natural scenes. When dealing with low-resource languages like Arabic and Urdu, a major adversary in developing a robust recognizer is the lack of large quantity of annotated data. We overcome this problem by synthesizing millions of images from a large vocabulary of words and phrases scraped from Wikipedia’s Arabic and Urdu versions, using a wide variety of fonts downloaded from various online resources. Building robust recognizers for Arabic and Urdu text has always been a challenging task. Though a lot of research has been done in the field of text recognition, the focus of the vision community has been primarily on English. While, Arabic script has started to receive some spotlight as far as text recognition is concerned, works on other languages which use the Nabatean family of scripts, like Urdu and Persian, are very limited. Moreover, the quality of the works presented in this field generally lack a standardized structure making it hard to reproduce and verify the claims or results. This is quite surprising considering the fact that Arabic is the fifth most spoken language in the world after Chinese, English, Spanish and Hindi catering to 4.7% of the world’s population, while Urdu has over a 100 million speakers and is spoken widely in Pakistan, where it is the national language, and India where it is recognized as one of the 22 official languages. In this thesis, we introduce the problems related with text recognition of low-resource languages, namely Arabic and Urdu, in various scenarios. We propose a language independent Hybrid CNN - RNN architecture which can be trained in an end-to-end fashion and prove it’s dominance over simple RNN based methods. Moreover, we dive deeper into the working of its convolutional layers and verify the robustness of convolutional-features through layer visualizations. We also propose a method to synthe- size artificial text images to do away with the need of annotating large amounts of training data. We outperform previous state-of-the-art methods on existing benchmarks by quite some margin and release two new benchmark datasets for Arabic Scene Text and Urdu Printed Text Recognition to instill interest among fellow researchers of the field.

 

Year of completion:  June 2018
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Mohit Jain, Minesh Mathew and C. V. Jawahar - Unconstrained OCR for Urdu using Deep CNN-RNN Hybrid Networks 4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China, 2017. [PDF]

  • Minesh Mathew , Mohit Jain and C. V. Jawahar - Benchmarking Scene Text Recognition in Devanagari, Telugu and Malayalam 6th International Workshop on Multilingual OCR, Kyoto, Japan, 2017. [PDF]

  • Mohit Jain, Minesh Mathew and C. V. Jawahar - Unconstrained Scene Text and Video Text Recognition for Arabic Script 1st International Workshop on Arabic Script Analysis and Recognition (ASAR 2017), Nancy, France, 2017. [PDF]


Downloads

thesis

Studies in Recognition of Telugu Document Images


Venkat Rasagna (homepage)

Abstract

The rapid evolution of information technology (IT) has prompted a massive growth in digitizing books. Accessing these huge digital collections require solutions, which will enable the archived ma- terials to be searchable. These solutions can only be acquired through research in document image understanding. In the last three decades, many significant developments have been made in the recog- nition of Latin-based scripts. The recognition systems for Indian languages are very far behind the European language recognizers like English. The diversity of archived printed document poses an ad- ditional challenge to document analysis and understanding. In this work, we explore the recognition of printed text in Telugu, a south Indian language. We begin our work by building the Telugu script model for recognition and adopting an existing optical character recognition system for the same. A comprehensive study on all the modules of the optical recognizer is done, with the focus mainly on the recognition module. We then evaluate the recognition module by testing it on the synthetic and real datasets. We achieved an accuracy of 98% on synthetic dataset, but the accuracy drops to 91% on 200 pages from the scanned books (real dataset). To analyze the drop in accuracy and the modules propagating errors, we create datasets with different qualities namely laser print dataset, good real dataset and challenging real dataset. Analysis of these experiments revealed the major problems in the character recognition module. We observed that the recognizer is not robust enough to tackle the multifont problem. The classifier’s component accuracy varied significantly on pages from different books. Also, there was a huge difference in the component and word accuracies. Even with a component accuracy of 91%, the word accuracy was just 62%. This motivated us to solve the multifont problem and improve the word accuracies. Solving these problems would boost the OCR accuracy of any language.

A major requirement in the design of robust OCRs is the invariance of feature extraction scheme with the popular fonts used in the print. Many statistical and structural features have been tried for character classification in the past. In this work, we get motivated by the recent successes in object category recognition literature and use a spatial extension of the histogram of oriented gradients (HOG) for character classification. We conducted the experiments on 1.46 million Telugu character samples in 359 classes and 15 fonts. On this data set, we obtain an accuracy of 96-98% with an SVM classifier. Typical optical character recognizer (OCR) only uses local information about a particular character or word to recognize it. In this thesis, we also propose a document level OCR which exploits the fact that multiple occurrences of the same word image should be recognized as the same word. Whenever the OCR output differs for the same word, it must be due to recognition errors. We propose a method to identify such recognition errors and automatically correct them. First, multiple instances of the same word image are clustered using a fast clustering algorithm based on locality sensitive hashing. Three different techniques are proposed to correct the OCR errors by looking at differences in the OCR output for the words in the cluster. They are character majority voting, an alignment technique based on dynamic time warping and one based on Progressive Alignment of multiple sequences. In this work, we demonstrate the approach over hundreds of document images from English and Telugu books by correcting the output of the best performing OCRs for English and Telugu. The recognition accuracy at word level is improved from 93% to 97% for English and from 58% to 66% for Telugu. Our approach is applicable to documents in any language or script.

 

Year of completion:  May 2013
 Advisor : Prof. C.V. Jawahar

Related Publications


Downloads

thesis

 ppt

Optical Character Recognition as Sequence Mapping


Devendra Kumar Sahu (Homepage)

Abstract

Digitization can provide a means of preserving the content of the materials by creating an accessible facsimile of the object in order to put less strain on already fragile originals such as out of print books. The document analysis community formed to address this by digitizing the content thus, making it easily shareable over Internet, making it searchable and, enabling language translation on them. In this thesis, we have tried to see optical character recognition as a mapping problem. We proposed extensions to two method and reduced its limitations. We proposed an application for sequence to sequence learning architecture which removed two limitations of previous state of art method based of connectionist temporal classification output layer. This method also gives representations which can be used for efficient retrieval. We also proposed an extension to profile features which enabled us to use same idea but by learning features from data.

In first work, we propose an application of sequence to sequence learning approach for printed text Optical Character Recognition. In contrast to present day existing state-of-art OCR solution which uses Connectionist Temporal Classification (CTC) output layer our approach makes minimalistic assumptions on the structure and length of the sequence. We use a two step encoder-decoder approach – (a) A recurrent encoder reads a variable length printed text word image and encodes it to a fixed dimensional embedding. (b) This fixed dimensional embedding is subsequently comprehended by decoder structure which converts it into a variable length text output. The learnt deep word image embedding from encoder can be used for printed text based retrieval systems. The expressive fixed dimensional embedding for any variable length input expedites the task of retrieval and makes it more efficient which is not possible with other recurrent neural network architectures. Thus single model can do predictions and features learnt with supervision can be used for efficient retrieval.

In the second work, we investigate the possibility of learning an appropriate set of features for designing OCR for a specific language. We learn the language specific features from the data with no supervision. In this work, we learn features using a unsupervised feature learning and use it with the RNN based recognition solution. We learn features using a stacked Restricted Boltzman Machines (RBM). These features can be interpreted as deep extension of projection profiles. These features can be used as a plug and play features to get improvements where profile feature are used. We validate these features on five different languages. In addition, these novel features also resulted in better convergence rate of the RNNs.

 

Year of completion:  January 2017
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Devendra Sahu, C. V. Jawahar - Unsupervised Feature Learning For Optical Character Recognition Proceedings of the 13th IAPR International Conference on Document Analysis and Recognition, 23-26 Aug 2015 Nancy, France. [PDF]


Downloads

thesis

 ppt

Human Pose Estimation: Extension and Application


Digvijay Singh (Homepage)

Abstract

Understanding human appearance in images and videos is one of the most fundamental and explored area in the field. Describing human appearance can be interpreted as a concoction of smaller and more fundamental related aspects like posture, gesture, outlook etc. By doing so we try to grasp holistic sense from semantically lower level information. The utility behind understanding human appearance and related aspects is the industrial demand for applications that involve analyzing humans and their interaction with surroundings. This thesis work tackles two of such related aspects: i. more fundamental problem, human pose estimation and ii. deeper understanding from cloth parsing based on pose estimation.

Determining the human body joint locations and configuration is quizzed as human pose estimation problem. In this work we address the problem of human pose estimation for video sequence data type. We exploit the availability of redundant information from redundant data type. The proposed iteratively functioning methodology has the first iteration involving parsing data from a quintessential generic base model. For the following iterations, we run a 3 step pipeline: grabbing confidently positive detections from previous iteration using our novel selection criteria, fine-tuning external-to-base parameters to local distribution by synthesizing exemplars from picked ones, and enforcing the learned information using an updated amalgamation model. The resulting pipeline propagates correctness in temporal neighborhoods of a video sequence. Previous methods that use the same base models have relied more on tracking strategies. From the unbiased experiments conducted, our approach has proven to be much more robust and overall better performing.

In the second half, we indulge in determining a more deeper understanding of human aspects i.e. cloth parsing. This involves predicting the cloth types worn by humans and their segmented regions in images. Our work focuses on incorporating robustness to a previously formulated method. Conceivably, determining hot regions for each cloth type is dependent on the underlying pose skeleton of the human, eg. hat will be worn on the head. Hence, availability of pose information is key to cloth parsing problem, but incorrect body part estimations can also simply lead to false cloth detections and segmentations. The previous method uses pose information from pictorial structure based model, whereas we update the formulation to incorporate information from more robust part detectors. The changed model has shown to be performing better at more wild outdoor settings. However, the performance from available and proposed methods is below par and appears non-viable for application purposes. To answer this, we report a set of experiments in which we take different lookouts and report observations.

 

Year of completion:  January 2017
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Digvijay Singh, Vineeth Balasubramanian, C. V. Jawahar - Fine-Tuning Human Pose Estimations in Videos Proceedings of the IEEE Winter Conference on Applications of Computer Vision(WACV), 2016. [PDF]

  • Nataraj Jammalamadaka, Ayush Minocha, Digvijay Singh and C V Jawahar - Parsing Clothes in Unrestricted Images Proceedings of the 24th British Machine Vision Conference (BMVC), 09-13 Sep. 2013, Bristol, UK. [PDF]


Downloads

thesis

 ppt

Script and Language Identification for Document Images and Scene Texts


Ajeet Kumar Singh

Abstract

In recent times, there have been an increase in Optical Character Recognition (OCR) solutions for recognizing the text from scanned document images and scene-texts taken with the mobile devices. Many of these solutions works very good for individual script or language. But in multilingual environment such as in India, where a document image or scene-images may contain more than one language, these individual OCRs fail significantly. Hence, in order to recognize texts in the multilingual document image or scene-image, we need to, manually, specify the script or language for each text blocks. Then, the corresponding script/language OCR is applied to recognize the inherent tasks. This is a step which is preventing us to move forward in the direction of fully-automated multi-lingual OCRs.

This thesis presents, two effective solutions to identify the scripts and language of document images and scene-texts, automatically. Even though, recognition problems for scene texts has been highly researched, the script identification problem in this area is relatively new. Hence, we present an approach which represents the scene-text images using mid-level strokes based features which are pooled from the densely computed local features. These features are then classified into languages by using an off-the-shelf classifier. This approach is efficient and require very less labeled data for script identification. The approach has been evaluated on recently introduced video script dataset (CVSI). We also introduce and benchmark a more challenging Indian language Scene Text (ILST) dataset for evaluating the
performance of our method.

For script and language identification in document we investigate the utility of Recurrent Neural Network (RNN). These problems have been attempted in the past with representations computed from the distribution of connected components or characters (e.g. texture, n-gram) from a larger segment (a paragraph or a page). We argue that, one can predict the script or language with minimal evidence (e.g. given only a word or a line) very accurately with the help of a pre-trained RNN. We propose a simple and generic solution for the task of script and language identification without any special tuning. This approach has been verified on a large corpus of more that 15:03M words across 55K documents comprising 15 scripts and languages.

The thesis aims to provide a better recognition solutions in document and scene-texts space by providing two simple, but effective solutions for script and language identification. The proposed algorithms can be used in multilingual settings, where the identification module will first identify the inherent script or language of incoming document or scene-texts before sending them to corresponding script/language recognition module.

Year of completion:  January 2017
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Minesh Mathew, Ajeet Kumar Singh and C V Jawahar - Multilingual OCR for Indic Scripts - Proceedings of 12th IAPR International Workshop on Document Analysis Systems (DAS'16), 11-14 April, 2016, Santorini, Greece. [PDF]

  • Ajeet Kumar Singh, Anand Mishra, Pranav Dabral and C V Jawahar - A Simple and Effective Solution for Script Identification in the Wild - Proceedings of 12th IAPR International Workshop on Document Analysis Systems (DAS'16), 11-14 April, 2016, Santorini, Greece. [PDF]

  • Ajeet Kumar Singh, C. V. Jawahar - Can RNNs Reliably Separate Script and Language at Word and Line Level Proceedings of the 13th IAPR International Conference on Document Analysis and Recognition, 23-26 Aug 2015 Nancy, France. [PDF]


Downloads

thesis

 ppt

More Articles …

  1. A Sketch-based Approach for Multimedia Retrieval
  2. Automatic Analysis of Cricket And Soccer Broadcast Videos
  3. Text Recognition and Retrieval in Natural Scene Images
  4. Distinctive Parts for Relative attributes
  • Start
  • Prev
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. MS Thesis
  5. Thesis Students
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.