Studies in Recognition of Telugu Document Images
Venkat Rasagna (homepage)
The rapid evolution of information technology (IT) has prompted a massive growth in digitizing books. Accessing these huge digital collections require solutions, which will enable the archived ma- terials to be searchable. These solutions can only be acquired through research in document image understanding. In the last three decades, many significant developments have been made in the recog- nition of Latin-based scripts. The recognition systems for Indian languages are very far behind the European language recognizers like English. The diversity of archived printed document poses an ad- ditional challenge to document analysis and understanding. In this work, we explore the recognition of printed text in Telugu, a south Indian language. We begin our work by building the Telugu script model for recognition and adopting an existing optical character recognition system for the same. A comprehensive study on all the modules of the optical recognizer is done, with the focus mainly on the recognition module. We then evaluate the recognition module by testing it on the synthetic and real datasets. We achieved an accuracy of 98% on synthetic dataset, but the accuracy drops to 91% on 200 pages from the scanned books (real dataset). To analyze the drop in accuracy and the modules propagating errors, we create datasets with different qualities namely laser print dataset, good real dataset and challenging real dataset. Analysis of these experiments revealed the major problems in the character recognition module. We observed that the recognizer is not robust enough to tackle the multifont problem. The classifier’s component accuracy varied significantly on pages from different books. Also, there was a huge difference in the component and word accuracies. Even with a component accuracy of 91%, the word accuracy was just 62%. This motivated us to solve the multifont problem and improve the word accuracies. Solving these problems would boost the OCR accuracy of any language.
A major requirement in the design of robust OCRs is the invariance of feature extraction scheme with the popular fonts used in the print. Many statistical and structural features have been tried for character classification in the past. In this work, we get motivated by the recent successes in object category recognition literature and use a spatial extension of the histogram of oriented gradients (HOG) for character classification. We conducted the experiments on 1.46 million Telugu character samples in 359 classes and 15 fonts. On this data set, we obtain an accuracy of 96-98% with an SVM classifier. Typical optical character recognizer (OCR) only uses local information about a particular character or word to recognize it. In this thesis, we also propose a document level OCR which exploits the fact that multiple occurrences of the same word image should be recognized as the same word. Whenever the OCR output differs for the same word, it must be due to recognition errors. We propose a method to identify such recognition errors and automatically correct them. First, multiple instances of the same word image are clustered using a fast clustering algorithm based on locality sensitive hashing. Three different techniques are proposed to correct the OCR errors by looking at differences in the OCR output for the words in the cluster. They are character majority voting, an alignment technique based on dynamic time warping and one based on Progressive Alignment of multiple sequences. In this work, we demonstrate the approach over hundreds of document images from English and Telugu books by correcting the output of the best performing OCRs for English and Telugu. The recognition accuracy at word level is improved from 93% to 97% for English and from 58% to 66% for Telugu. Our approach is applicable to documents in any language or script.
|Year of completion:
|Prof. C.V. Jawahar