Word Recognition of Indic Scripts

Naveen TS (homepage)

Optical Character Recognition (OCR) problems are often formulated as isolated character (symbol) classification task followed by a post-classification stage (which contains modules like UNICODE generation, error correction etc. ) to generate the textual representation, for most of the Indian scripts. Such approaches are prone to failures due to (i) difficulties in designing reliable word-to-symbol segmentation module that can robustly work in presence of degraded (cut/fused) images and (ii) converting the outputs of the classifiers to a valid sequence of UNICODES. In this work, we look at two important aspects of word recognition -- word image to text string conversion and error detection and correction in words represented as UNICODES. In this thesis, we propose a formulation, where the expectations on the two critical modules of a traditional OCR (i.e, segmentation and isolated character recognition) is minimized. And the harder recognition task is modelled as learning of an appropriate sequence to sequence transcription scheme. We thus formulate the recognition as a direct transcription problem. Given many examples of feature sequences and their corresponding UNICODE representations, our objective is to learn a mapping which can convert a word directly into a UNICODE sequence. This formulation has multiple practical advantages: (i) This reduces the number of classes significantly for the Indian scripts. (ii) It removes the need for a word-to-symbol segmentation. (ii) It does not require strong annotation of symbols to design the classifiers, and (iii) It directly generates a valid sequence of UNICODES. We test our method on more than 5000 pages of printed documents for multiple languages. We design a script independent, segmentation free architecture which works well for 7 Indian scripts. Our method is compared against other state-of-the-art OCR systems and evaluated using a large corpora.

Second contribution of this thesis is in investigating the possibility of error detection and correction in highly inflectional languages. We take Malayalam and Telugu as the examples. Error detection in OCR output using dictionaries and statistical language models (SLMs) have become common practice for some time now, while designing post-processors. Multiple strategies have been used successfully in English to achieve this. However, this has not yet translated towards improving error detection performance in many inflectional languages, especially Indian languages. Challenges such as large unique word list, lack of linguistic resources, lack of reliable language models, etc. are some of the reasons for this. In this thesis, we investigate the major challenges in developing error detection techniques for highly inflectional Indian languages. We compare and contrast several attributes of English with inflectional languages such as Telugu and Malayalam. We make observations by analysing statistics computed from popular corpora and relate these observations to the error detection schemes. We propose a method which can detect errors for Telugu and Malayalam, with an F-Score comparable to some of the less inflectional languages like Hindi. Our method learns from the error patterns and SLMs.


Year of completion:  January 2014
 Advisor : Prof. C. V. Jawahar


Related Publications

  • Praveen Krishnan, Naveen Sankaran, Ajeet Kumar Singh and C. V. Jawahar - Towards a Robust OCR System for Indic Scripts Proceedings of the 11th IAPR International Workshop on Document Analysis Systems, 7-10 April 2014, Tours-Loire Valley, France. [PDF]

  • Naveen Sankaran, Aman Neelappa and C V Jawahar - Devanagari Text Recognition: A Transcription Based Formulation Proceedings of the 12th International Conference on Document Analysis and Recognition, 25-28 Aug. 2013, Washington DC, USA. [PDF]

  • Naveen Sankaran and C V Jawahar - Error Detection in Highly Inflectional Languages Proceedings of the 12th International Conference on Document Analysis and Recognition, 25-28 Aug. 2013, Washington DC, USA. [PDF]

  • Naveen Sankarana, C V Jawahar - Recognition of Printed Devanagari Text Using BLSTM Neural Network Proceedings of 21st International Conference on Pattern Recognition, 11-15 Nov. 2012, pp.322-325Vol. 21 ISBN 978-4-9906441-1-6, Japan. [PDF]