Large Scale Character Classification

Large scale pattern recognition systems are necessary in many real life problems like object recognition, bio-informatics, character recognition, biometrics and data-mining. This thesis focuses on pattern classification issues associated with character recognition, with special emphasis on Malayalam. We propose an architecture for the character classification, and proves the utility of the the proposed method by validating on a large dataset. The challenges in this work includes, (i) Classification in presence of large number of classes (ii) Efficient implementation of effective large scale classification (iii) Performance analysis and learning in large data sets (of Millions of examples). Throughout this work, we use examples of characters (and symbols) extracted from real-life Malayalam document images. Developing annotated data set at the symbol level from a coarse (say word-level) annotated data is addressed first with the help of a dynamic programming based algorithm. Algorithm is then generalized to handle the popular degradations in the form cuts, merges and other artifacts. As a byproduct this algorithms allows to quantitatively estimate the quality of the books, documents and words. The dynamic programming based algorithm align the text (in UNICODE) and image (in Pixels). This helps in developing a large data set which could help in conducting large scale character classification experiments.

We then conduct an empirical study of classifiers and feature combination to study their suitability to the problem of character classification. The scope of this study include (a) applicability of a spectrum of classifiers and features (b)scalability of classifiers (c) sensitivity of features to degradation (d) generalization across fonts and (e) applicability across scripts. It may be noted that all these aspects are important to solve the character classification problem. Our empirical studies and theoretical results provide convincing evidences to support the utility of SVM (multiple pair-wise) classifiers for solving the problem. However, a direct use of multiple SVM classifiers has certain disadvantages: (i) since there are nC2 pairwise classifiers, storage and computational complexity of the final classifier becomes high for many practical applications. (ii) they directly provide a class label and fail to provide an estimate of the posterior probability. We address these issues by efficiently designing a Decision Directed Acyclic Graph (DDAG) classifier and using the appropriate feature space. We also propose efficient methods to minimize the storage complexity of support vectors for the classification purpose. We also extend our algebraic simplification method for simplifying hierarchical classifier solutions.We use SVM pair-wise classifiers with DDAG architecture for classification. We use linear kernel for SVM, considering the fact that most of the classes in a large class problem are linearly separable.

We carried out our classification experiments on a huge data set, with more than 200 classes and 50 million examples, collected from 12 scanned Malayalam books. Based on the number of cuts, merges detected, the quality definitions are imposed on the document image pages. The experiments are conducted on pages with various quality. We could achieve a reasonably high accuracy on all the data considered. We do an extensive evaluation of the performance on this data set which is more than 2000 pages.

In presence of large and diverse collection of examples, it becomes important to continously learn and adapt. Such an approach could be more significant while recognizing books. We extend our classifier sysyem to continuously improve the performance by providing feedback and retraining the classifier. This thesis focuses on pattern classification issues associated with character recognition, with special emphasis on Malayalam. We propose an architecture for the character classification, and proves the utility of the the proposed method by validating on a large dataset. The challenges in this work includes, (i) Classification in presence of large number of classes (ii) Efficient implementation of effective large scale classification (iii) Performance analysis and learning in large data sets (of Millions of examples).

To summarize, major contributions of this work are:

1. A highly script independent dynamic programming (DP) based method to build large dataset for testing and training character recognition systems.
2. Empirical studies on large dataset of various Indian languages to evaluate the performance of state of the art classifiers and features on large datasets.
3. A hierarchical method to improve the computational complexity of SVM classifier for large class problems.
4. An efficient design and implementation of SVM classifier to effectively handle large class problems. The classifier module has employed for a OCR system for Malayalam.
5. The performance evaluations of the above mentioned methods on a large dataset. We tested on a large dataset of twelve Malayalam books, which is more than 2000 document pages.
6. A novel system for adapting a classifier for recognizing symbols in a book.

(more...)

Year of completion:	2010
Advisor :	C. V. Jawahar

Related Publications

Neeba N.V., and C. V. Jawahar - Empirical Evaluation of Character Classification Schemes Proceedings of the 7th International Conference on Advances in Pattern Recognition (ICAPR 2009), Feb . 4-6, 2009, Kolkotta, India. [PDF]
Ilayaraja Prabhakaran, Neeba N.V., and C.V. Jawahar - Efficient Implementation of SVM for Large Class Problems Proc. of the 19th International Conferenc eon Pattern Recognition(ICPR 08), Dec. 8-11,2008, Florida, USA. [PDF]
Neeba N.V., and C. V. Jawahar - Recognition of Books by Verification and Retraining Proc. of the 19th International Conference on Pattern Recognition(ICPR 08), Dec. 8-11,2008, Florida, USA. [PDF]

Book Chapter

N.V. Neeba, Anoop Namboodiri, C.V. Jawahar and P. J. Narayanan - Recogniton of Malayalam Documents, in Guide To Ocr For Indic Scripts: Part 1: Document Recognition And Retrieval - 2010.