Text Recognition and Retrieval in Natural Scene Images

Udit Roy (homepage)


In the past few years, text in natural scene images has gained potential to be a key feature for content based retrieval. They can be extracted and used in search engines, providing relevant information about the images. Robust and efficient techniques from the document analysis and the vision community were borrowed to solve the challenge of digitizing text in such images in the wild. In this thesis, we address the common challenges towards scene text analysis by proposing novel solutions for the recognition and retrieval settings. We develop end to end pipelines which detect and recognize text, the two core challenges of scene text analysis.

For the detection task, we first study and categorize all major publications since 2000 based on their architecture. Broadening the scope of a detection method, we propose a fusion of two complementary styles of detection. The first method evaluates MSER clusters as text or non-text using an adaboost classifier. The method outperforms the other publicly available implementations on standard ICDAR 2011 and MRRC datasets. The second method generates text region proposals using a CNN based text/nontext classifier with high recall. We compare the method with other object region proposal algorithms on the ICDAR datasets and analyse our results. Leveraging on the high recall of the proposals, we fuse the two detection methods to obtain a flexible detection scheme.

For the recognition task, we propose a conditional random field based framework for recognizing word images. We model the character locations as nodes and the bigram interactions as the pairwise potentials. Observing that the interaction potentials computed using the large lexicon are less effective than the small lexicon setting, we propose an iterative method, which alternates between finding the most likely solution and refining the interaction potentials. We evaluate our method on public datasets and obtain nearly 15% improvement in recognition accuracy over baseline methods on the IIIT-5K word dataset with a large lexicon containing 0.5 million words. We also propose a text query based retrieval task for word images and evaluate retrieval performance in various settings.

Finally, we present two contrasting end to end recognition frameworks for scene text analysis on
scene images. The first framework consists of text segmentation and a standard printed text OCR. The text segmented image is fed to Tesseract to get word regions and labels. This case sensitive and lexicon free approach performs at par with the other successful pipelines of the decade on the ICDAR 2003 dataset. The second framework combines the CNN based region proposal method with the CRF based recognizer with various lexicon sizes. Additionally, we also use the latter to retrieve scene images with text queries.


Year of completion:  October 2016
 Advisor : Prof. C.V. Jawahar

Related Publications