Document Image Retrieval using Bag of Visual Words Model.

Ravi Shekhar (homepage)

With the development of computer technology, storage has become more affordable. So, all traditional libraries are making their collections electronically available on internet or digital media. To search in these collections, better indexing structure is needed. It can be done manually or automatically. But manual indexing is time consuming and expensive. So, automation is the only feasible option. First approach in this direction was to convert document images into text by OCR and then perform traditional search. This method works well for European languages whose optical character recognizers (OCRs) are robust but did not work reliably for Indian and non-European languages like Hindi, Telugu and Malayalam etc. Even for European languages, OCRs are not available in the case of highly degraded document images.

To overcome limitations of OCR, word spotting was emerged as an alternative. In word spotting, given word image is represented as features and matching is done based on feature distance. Main challenges in word spotting are extraction of proper features and feature comparison mechanism. Profile feature is one of the popular ways to represent word image and Euclidean distance is used for comparison. To overcome word length, all images are scaled to same size, due to which lots of information is lost. Later, DTW based approach was proposed for matching which does not require scaling of images. The problem with DTW based approach is that it is very slow and can not be used for practical and large scale application.

In first part of this thesis, we explain a Bag of Visual Words (BoVW) based approach to retrieve similar word images from a large database, efficiently and accurately. We show that a text retrieval system can be adapted to build a word image retrieval solution. This helps in achieving scalability. We demonstrate the method on more than 1 Million word images with a sub-second retrieval time. We validate the method on four Indian languages, and report a mean average precision of more than 0.75. We represent the word image as histogram of visual words present in the image. Visual words are quantized representation of local regions, and for this work, SIFT descriptors at interest points are used as feature vectors. To address the lack of spatial structure in the BoVW representation, we re-rank the retrieved list. %This significantly improves the performance. This provides significant improvement in the performance.

Later, we have also performed enhancements over BoVW approach. Enhancements are in terms of query expansion, text query support and Locality constrained Linear Coding (LLC) based retrieval system. In query expansion, we have used initial results to modify our initial query and obtained better results. In BoVW model, query is given by example but users are generally interested in query as text like ``Google'', ``Bing'' etc. Text query is also supported by same model as BoVW. Later, LLC is used to achieve high recall. LLC scheme learns a data representation using nearest codeword and achieves improvement over performance.

In most of the scalable document search, features like SIFT, SURF are used. These are originally designed for natural images. Natural images have lots of variation and extra information in terms of gray images. Document images are binary images, so need features which can be specifically designed for them. We have proposed patch based features using profile features. We have compared our proposed feature with SIFT and obtained similar performance. Our feature has advantage that it is faster to compute compared to SIFT, which makes our pre-processing step very fast.

We have demonstrated that recognition free approach is feasible for large scale word image retrieval. In future work, similar approach can be used for hand written, camera based retrieval and natural scene text retrieval etc. Also, a better and efficient descriptor is needed specifically designed for document words.


Year of completion:  June 2013
 Advisor : C. V. Jawahar

Related Publications




Recent Updates

Summer school on machine learning: Deep learning 2017
Summer school on machine learning: Deep learning is being conducted at CVIT, IIIT Hyderabad during July 10-15, 2017.
Jul, 2017

Summer school on computer vision: Recent advances in computer vision 2017
Summer school on computer vision: Deep learning is being conducted at CVIT, IIIT Hyderabad during July 03-08, 2017.
Jul, 2017

CVIT organizes winter school 2016
SHORT COURSE ON DEEP LEARNING is being conducted at CVIT, IIIT Hyderabad during Dec 01-04, 2016.
Dec, 2016