Optical Character Recognition as Sequence Mapping

Devendra Kumar Sahu (Homepage)


Digitization can provide a means of preserving the content of the materials by creating an accessible facsimile of the object in order to put less strain on already fragile originals such as out of print books. The document analysis community formed to address this by digitizing the content thus, making it easily shareable over Internet, making it searchable and, enabling language translation on them. In this thesis, we have tried to see optical character recognition as a mapping problem. We proposed extensions to two method and reduced its limitations. We proposed an application for sequence to sequence learning architecture which removed two limitations of previous state of art method based of connectionist temporal classification output layer. This method also gives representations which can be used for efficient retrieval. We also proposed an extension to profile features which enabled us to use same idea but by learning features from data.

In first work, we propose an application of sequence to sequence learning approach for printed text Optical Character Recognition. In contrast to present day existing state-of-art OCR solution which uses Connectionist Temporal Classification (CTC) output layer our approach makes minimalistic assumptions on the structure and length of the sequence. We use a two step encoder-decoder approach – (a) A recurrent encoder reads a variable length printed text word image and encodes it to a fixed dimensional embedding. (b) This fixed dimensional embedding is subsequently comprehended by decoder structure which converts it into a variable length text output. The learnt deep word image embedding from encoder can be used for printed text based retrieval systems. The expressive fixed dimensional embedding for any variable length input expedites the task of retrieval and makes it more efficient which is not possible with other recurrent neural network architectures. Thus single model can do predictions and features learnt with supervision can be used for efficient retrieval.

In the second work, we investigate the possibility of learning an appropriate set of features for designing OCR for a specific language. We learn the language specific features from the data with no supervision. In this work, we learn features using a unsupervised feature learning and use it with the RNN based recognition solution. We learn features using a stacked Restricted Boltzman Machines (RBM). These features can be interpreted as deep extension of projection profiles. These features can be used as a plug and play features to get improvements where profile feature are used. We validate these features on five different languages. In addition, these novel features also resulted in better convergence rate of the RNNs.


Year of completion:  January 2017
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Devendra Sahu, C. V. Jawahar - Unsupervised Feature Learning For Optical Character Recognition Proceedings of the 13th IAPR International Conference on Document Analysis and Recognition, 23-26 Aug 2015 Nancy, France. [PDF]