Word Hashing for Efficient Search In Document Image Collections

Anand Kumar

A large numbers of document image collections are now being scanned and made available over the Internet or in digital braries. Effective access to such information sources is limited by the lack of efficient retrieval schemes. The use of text search methods requires efficient and robust optical character recognizers (OCR), which are presently unavailable for Indian languages. Word spotting - word image matching - may instead be used to retrieve word images in response to a word image query. The approaches used for word spotting so far, dynamic time warping and/or nearest neighbor search tend to be slow for large collection of books. Direct matching of images is inefficient due to the complexity of matching and thus impractical for large databases. In general, indexing and retrieval methods for document images cluster similar words and build indexes with the representatives of the clusters. The time required for building such a clustering based index is very high. Such indexing methods are time inefficient with the use of complex image matching procedures required in the clustering step. This problem is solved by directly hashing word image representations.

An efficient mechanism for indexing and retrieval in large document image collections is presented in this thesis. First, document images are segmented to get words. Then features are computed at word level and indexed. Word retrieval is done very efficiently with \emph{content-sensitive hashing} (CSH), which uses an approximate nearest neighbor search technique called locality sensitive hashing (LSH). The word images are hashed into hash tables using features computed at word level. Content-sensitive hash functions are used to hash words such that the probability of grouping similar words in the same index of the hash table is high. The sub-linear time CSH scheme makes the search very fast without degrading accuracy. Experiments on a collection of Kalidasa's - the classical Indian poet of antiquity - books in Telugu demonstrate that the word images may be searched in a few milliseconds. The approach thus makes searching document image collections practical. The search time is reduced significantly by hashing the words. (more...)


Year of completion:  2008
 Advisor : C. V. Jawahar & R. Manmatha

Related Publications

  • Anand Kumar, C.V. Jawahar & R. Manmatha - Efficient Search in Document Image Collections Proceedings of 8th Asian Conference on Computer Vision (ACCV'07),Part I, LNCS 4843, pp. 586.595 Tokyo Japan, 18-22 November, 2007. [PDF]

  • C.V. Jawahar and Anand Kumar - Content-level Annotation of Large Collection of Printed Document Images Proc of 9th International Conference on Document Analysis and Recognition, Brazil, 23-26 September, 2007. [PDF]

  • Anand Kumar, A. Balasubramanian, Anoop M. Namboodiri and C.V. Jawahar - Model-Based Annotation of Online Handwritten Datasets, International Workshop on Frontiers in Handwriting Recognition(IWFHR'06), October 23-26, 2006, La Baule, Centre de Congreee Atlantia, France. [PDF]