Human Pose Retrieval for Image and Video collections

Nataraj Jammalamadaka (homepage)


With overwhelming amount of visual data on the internet, it is beyond doubt that a search capability for this data is needed. While several commercial systems have been built to retrieve images and videos using meta-data, many of the images and videos do not have a detailed descriptions. This problem has been addressed by content based image retrieval (CBIR) systems which retrieve the images by their visual content. In this thesis, we will demonstrate that images and videos can be retrieved using the pose of the humans present in them. Here pose is the 2D/3D spatial arrangement of anatomical body parts like arms and legs. Pose is an important information which conveys action, gesture and the mood of the person. Retrieving humans using pose has commercial implications in domains such as dance (query being a dance pose) and sports (query being a shot). In this thesis, we propose three pose representations that can be used for retrieval. Using one of these representations, we will build a real-time pose retrieval system over million movie frames. Our first pose representation is based on the output of human pose estimation algorithms (HPE) [5, 26, 80, 103] which estimate the pose of the person given an image. Unfortunately, these algorithms are not entirely reliable and often make mistakes. We solve this problem by proposing an evaluator that predicts if a HPE algorithm has succeeded. We describe the stages required to learn and test an evaluator, including the use of an annotated ground truth dataset for training and testing the evaluator, and the development of auxiliary features that have not been used by the (HPE) algorithm, but can be learnt by the evaluator to predict if the output is correct or not. To demonstrate our ideas, we build an evaluator for each of four recently developed HPE algorithms using their publicly available implementations: Andriluka et al. [5], Eichner and Ferrari [26], Sapp et al. [80], and Yang and Ramanan [103]. We demonstrate that in each case our evaluator is able to predict whether the algorithm has correctly estimated the pose or not. In this context we also provide a new dataset of annotated stickmen. Further, we propose innovative ways in which a pose evaluator can be used. Specifically, we show how a pose evaluator can be used to filter incorrect pose estimates, to fuse outputs from different HPE algorithms, and to improve a pose search application. Our second pose representation is inspired by poselets [13] which are body detectors. First, we introduce deep poselets for pose-sensitive detection of various body parts, that are built on convolutional neural network (CNN) features. These deep poselets significantly outperform previous instantiations of Berkeley poselets [13]. Second, using these detector responses, we construct a pose representation that is suitable for pose search, and show that pose retrieval performance is on par with the state of the artpose representations. The compared methods include Bag of visual words [85], Berkeley poselets [13] and Human pose estimation algorithms [103, 18, 68]. All the methods are quantitatively evaluated on a large dataset of images built from a number of standard benchmarks together with frames from Hollywood movies. Our third pose representations is based on embedding an image into a lower dimensional pose-sensitive manifold. Here we make the following contributions: (a) We design an optimized neural network which maps the input image to a very low dimensional space where similar poses are close by and dissimilar poses are farther away, and (b) We show that pose retrieval system using these low dimensional representation is on par with the deep poselet representation. Finally, we describe a method for real time video retrieval where the task is to match the 2D human pose of a query. A user can form a query by (i) interactively controlling a stickman on a web based GUI, (ii) uploading an image of the desired pose, or (iii) using the Kinect and acting out the query himself. The method is scalable and is applied to a dataset of 22 movies totaling more than three million frames. The real time performance is achieved by searching for approximate nearest neighbors to the query using a random forest of K-D trees. Apart from the query modalities, we introduce two other areas of novelty. First, we show that pose retrieval can proceed using a low dimensional representation. Second, we show that the precision of the results can be improved substantially by combining the outputs of independent human pose estimation algorithms. The performance of the system is assessed quantitatively over a range of pose queries.


Year of completion:  July 2017
 Advisor : C V Jawahar and Andrew Zisserman

Related Publications

  • Nataraj Jammalamadaka, Andrew Zisserman and C.V. Jawahar - Human pose search using deep networks Image and Vision Computing 59 (2017): 31-43. [PDF]

  • Nataraj Jammalamadaka, Andrew Zisserman, C.V. Jawahar - Human Pose Search using Deep Poselets Proceedings of the 11th IEEE International Conference on Automatic Face and Gesture Recognition, 04-08 May 2015, Ljubljana, Slovnia.[PDF]

  • Nataraj Jammalamadaka, Andrew Zisserman and C.V. Jawahar - Human pose search using deep poselets Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on. Vol. 1. IEEE, 2015. [PDF]

  • Digvijay Singh, Ayush Minocha, Nataraj Jammalamadaka and C. V. Jawahar - Real-time Face Detection, Pose Estimation and Landmark Localization Proceedings of the IEEE National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, 18-21 Dec. 2013, Jodhpur, India. [PDF]

  • Nataraj Jammalamadaka, Ayush Minocha, Digvijay Singh and C V Jawahar - Parsing Clothes in Unrestricted Images Proceedings of the 24th British Machine Vision Conference, 09-13 Sep. 2013, Bristol, UK. [PDF]

  • Digvijay Singh, Ayush Minosha, Nataraj Jammalamadaka and C.V. Jawahar - Near real-time face parsing Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2013 Fourth National Conference on. IEEE, 2013. [PDF]

  • Nataraj Jammalamadaka, Andrew Zisserman, Marcin Eichner, Vittorio Ferrari and C V Jawahar - Video Retrieval by Mimicking Poses Proceedings of ACM International Conference on Multimedia Retrieval, 5-8 June 2012, Article No.34, ISBN 978-1-4503-1329-2, Hong Kong. [PDF]

  • Nataraj Jammalamadaka, Andrew Zisserman, Marcin Eichner, Vittorio Ferrari and C V Jawahar - Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators Proceedings of 12th European Conference on Computer Vision, 7-13 Oct. 2012, Print ISBN 978-3-642-33711--6, Vol. ECCV 2012, Part-III, LNCS 7574, pp. 114-128, Firenze, Italy. [PDF]

  • Nataraj Jammalmadaka, Vikram Pudi and C. V. Jawahar - Efficient Search with Changing Similarity Measures on Large Multimedia Datasets Proc. of The International Multimedia Modelling Conference(MMM2007), LNCS 4352, Part-II, PP. 206-215, 2007. [PDF]

  • C. V. Jawahar, Balakrishna Chennupati, Balamanohar Paluri and Nataraj Jammalamadaka, Video Retrieval Based on Textual Queries , Proceedings of the Thirteenth International Conference on Advanced Computing and Communications, Coimbatore, December 2005. [PDF]



Understanding Semantic Association Between Images and Text

Yashaswi Verma (homepage)


Since the last two decades, vast amounts of digital data have been created, a large portion of which is the visual data comprising images and videos. To deal with this large amount of data, it has become necessary to build systems that can help humans to efficiently organize, index and retrieve from such data. While modern search engines are quite efficient in text-based indexing and retrieval, their “visual cortex” is still evolving. One way to address this is to enable similar technologies as those used for textual data for archiving and retrieving the visual data. Also, in practice people find it more convenient to interact with visual data using text as an interface rather than a visual interface. However, this in turn would require describing images and videos using natural text. Since it is practically infeasible to annotate these large visual collections manually, we need to develop automatic techniques for the same. In this thesis, we present our attempts towards modelling and learning semantic associations between images and different forms of text such as labels, phrases and captions. Our first problem is that of tagging images with discrete semantic labels, also called image annotation. To address this, we describe two approaches. In the first approach, we propose a novel extension of the conventional weighted k-nearest neighbour algorithm that tries to address the issues of class-imbalance and incomplete-labelling that are quite common in the image annotation task. In the second approach, we first analyze why the conventional SVM algorithm, despite its strong theoretical properties, does not achieve as good results as nearest neighbour based methods on the image annotation task. Based on this analysis, we propose an extension of SVM by introducing a tolerance parameter into the hinge-loss. This additional parameter helps in making binary models tolerant to practical challenges such as incomplete-labelling, label ambiguity and structural overlap. Next we target the problem of image captioning and caption-based image retrieval. Rather than using either individual words or entire captions, we propose to make use of textual phrases for both these tasks; e.g., “aeroplane at airport”, “person riding”, etc. These phrases are automatically extracted from available data based on linguistic constraints. To generate a caption for a new image, first the phrases present in the neighbouring (annotated) images are ranked based onvisual similarity. These are then integrated into a pre-defined template for caption generation. During caption based image retrieval, a given query is first decomposed into such phrases, and images are then ranked based on their joint relevance with these phrases. Lastly, we address the problem of cross-modal image-text retrieval. For this, we first present a novel Structural SVM based formulation for this task. We show that our formulation is generic, and can be used with a variety of loss functions as well as feature vector based representations. Next, we try to model higher-level semantics in multi/cross-modal data based on shared category information. For this, we first propose the notion of cross-specificity, and then present a generic framework based on cross-specificity that can be used as a wrapper function over several cross-modal matching approaches, and helps in boosting their performance on the cross-modal retrieval task. We evaluate the proposed methods on a number of popular and relevant datasets. On the image annotation task, we achieve near state-of-the-art results under multiple evaluation metrics. On the image captioning task, we achieve superior results compared to conventional methods that are mostly based on visual cues and corpus statistics. On the cross-modal retrieval task, both our approaches provide compelling improvements over baseline cross-modal retrieval techniques.


Year of completion:  July 2017
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Yashaswi Verma and C.V. Jawahar - A support vector approach for cross-modal search of images and texts Computer Vision and Image Understanding 154 (2017): 48-63. [PDF]

  • Yashaswi Verma, C.V. Jawahar - A Robust Distance with Correlated Metric Learning for Multi-Instance Multi-Label Data Proceedings of the ACM Multimedia, 2016, Amsterdam, The Netherlands. [PDF]

  • Yashaswi Verma, C.V. Jawahar - Image Annotation by Propagating Labels from Semantic Neighbourhoods International Journal of Computer Vision (IJCV), 2016. [PDF]

  • Yashaswi Verma, C. V. Jawahar - A Probabilistic Approach for Image Retrieval Using Descriptive Textual Queries Proceedings of the ACM Multimedia, 26-30 Oct 2015, Brisbane, Australia. [PDF]

  • Yashaswi Verma, C.V. Jawahar - Exploring Locally Rigid Discriminative Patched for Learning Relative Attributes Proceedings of the 26th British Machine Vision Conference, 07-10 Sep 2015, Swansea, UK. [PDF]

  • Yashaswi Verma and C.V. Jawahar - Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval Proceedings of British Machine Vision Conference, 01-05 Sep 2014, Nottingam, UK. [PDF]

  • Ramachandruni N. Sandeep, Yashaswi Verma and C.V. Jawahar - Relative Parts : Distinctive Parts of Learning Relative Attributes Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 23-28 June 2014, Columbus, Ohio, USA. [PDF]

  • Sandeep, Ramachandruni N, Yashaswi Verma and C.V. Jawahar - Relative parts: Distinctive parts for learning relative attributes Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. [PDF]

  • Yashaswi Varma and C V Jawahar - Exploring SVM for Image Annotation in Presence of Confusing Labels Proceedings of the 24th British Machine Vision Conference, 09-13 Sep. 2013, Bristol, UK. [PDF]

  • Yashaswi Verma, Ankush Gupta, Prashanth Mannem and C.V. Jawahar - Generating image descriptions using semantic similarities in the output space Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013. [PDF]

  • Yashaswi Verma and C V Jawahar - Neti Neti: In Search of Deity Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing, 16-19 Dec. 2012, Bombay, India. [PDF]

  • Yashaswi Varma and C V Jawahar - Image Annotation using Metric Learning in Semantic Neighbourhoods Proceedings of 12th European Conference on Computer Vision, 7-13 Oct. 2012, Print ISBN 978-3-642-33711--6, Vol. ECCV 2012, Part-III, LNCS 7574, pp. 114-128, Firenze, Italy. [PDF]

  • Ankush Gupta, Yashaswi Verma and C.V. Jawahar - Choosing Linguistics over Vision to Describe Images AAAI. 2012. [PDF]



Aligning Textual & Visual Data : Towards Scalable Multimedia Retrieval

Pramod Sankar Kompalli (homepage)

The search and retrieval of relevant images and videos from large repositories of multimedia, is acknowledged as one of the hard challenges of computer science. With existing pattern recognition solutions, one cannot obtain detailed, semantic description for a given multimedia document. Several limitations exist in feature extraction, classification schemes, along with the incompatibility of representations across domains. The situation will most likely remain so, for several years to come.

Towards addressing this challenge, we observe that several multimedia collections contain similar parallel information that are: i) semantic in nature, ii) weakly aligned with the multimedia and iii) available freely. For example, the content of a news broadcast is also available in the form of newspaper articles. If a correspondence could be obtained between the videos and such parallel information, one could access one medium using the other, which opens up immense possibilities for information extraction and retrieval. However, it is challenging to find the mapping between the two sources of data due to the unknown semantic hierarchy within each medium and the difficulty to match information across the different modalities. In this thesis, we propose novel algorithms that address these challenges.

Different &lt Multimedia, Parallel Information &gt pairs, require different alignment techniques, depending on the granularity at which entities could be matched across them. We choose four pairs of multimedia, along with parallel information obtained in the <i>text</i> domain, such that the data is both challenging and available on a large scale. Specifically, our multimedia consists of movies, broadcast sports videos and document images, with the parallel text coming from scripts, commentaries and language resources. As we proceed from one pair to the next, we discover an increasing complexity of the problem, due to a relaxation of the temporal binding between the parallel information and the multimedia. By addressing this challenge, we build solutions that perform increasingly fine-grained alignment between multimedia and text data.

The framework that we propose begins with an assumption that we could segment the multimedia and the text into meaningful entities that could correspond to each other. The problem then, is to identify <i>features</i> and learn to match a text-entity to a multimedia-segment (and vice versa). Such a matching scheme could be refined using additional constraints, such as temporal ordering and occurrence statistics. We build algorithms that could align across i) movies and scripts, where sentences from the script are aligned to their respective <i>video-shots</i> and ii) document images with lexicon, where the words of the dictionary are mapped to clusters of word-images extracted from the scanned books.

Further, we relax the constraint in the above assumption, such that the segmentation of the multimedia is not available <i>apriori<\i>. The problem now, is to perform a joint inference of segmentation and annotation. We address this problem by building an over-complete representation of the multimedia. A large number of putative segmentations are matched against the information extracted from the parallel text, with the joint inference achieved through dynamic programming. This approach was successfully demonstrated on i) Cricket videos, which were segmented and annotated with information from online commentaries and ii) word-images, where sub-words called Character N-Grams, are accurately segmented and labelled using the text-equivalent of the word.

As a consequence of the approaches proposed in this thesis, we were able to demonstrate text-based retrieval systems over large multimedia collections. The semantic level at which we can retrieve information was made possible by the annotation with parallel text information. Our work also results in a large set of labeled multimedia, which could be used by sophisticated machine learning algorithms for learning new concepts. (more...)


Year of completion:  May 2015
 Advisor :

Prof. C. V. Jawahar

Related Publications

  • Pramod Sankar K, R Manmatha and C V Jawahar - Large Scale Document Image Retrieval by Automatic Word Annotation International Journal on Document Analysis and Recognition (IJDAR):Volume 17, Issue 1(2014), Page 1-17. [PDF]

  • Udit Roy, Naveen Sankaran, Pramod Sankar K. and C V JawaharCharacter N-Gram Spotting on Handwritten Documents using Weakly-Supervised Segmentation Proceedings of the 12th International Conference on Document Analysis and Recognition, 25-28 Aug. 2013, Washington DC, USA. [PDF]

  • Shrey Dutta, Naveen Sankaran, Pramod Sankar K and C V JawaharRobust Recognition of Degraded Documents Using Character N-Grams Proceedings of 10th IAPR International Workshop on Document Analysis Systems 27-29 Mar. 2012, ISBN 978-1-4673-0868-7, pp. 130-134, Queensland, Australia. [PDF]

  • Sudha Praveen M., Pramod Sankar K., and C.V. Jawahar - Character n-Gram Spotting in Document Images Proceedings of 11th International Conference on Document Analysis and Recognition (ICDAR 2011),18-21 September, 2011, Beijing, China. [PDF]

  • Pramod Kompalli, C.V.Jawahar and R. Manmatha - Nearest Neighbor based Collection OCR Proceedings of Ninth IAPR International Workshop on Document Analysis Systems (DAS'10), pp. 207-214, 9-11 June, 2010, Boston, MA, USA. [PDF]

  • Pramod Sankar K, C. V. Jawahar and Andrew Zisserman - Subtitle-free Movie to Script Alignment Proceedings of British Machine Vision Conference (BMVC 09), 7-10 September, 2009, London, UK. [PDF]

  • Pramod Sankar K and C.V. Jawahar - Probabilistic Reverse Annotation for Large Scale Image Retrieval Proc of IEEE Computer Society Conference on Computer Vision and Pattern Recognition,Minneapolis, Minnesona, 18-23 June, 2007. [PDF]

  • Pramod Sankar K. and C.V. Jawahar - Enabling Search over Large Collections of Telugu Document Images-An Automatic Annotation Based Approach , 5th Indian Conference on Computer Vision, Graphics and Image Processing, Madurai, India, LNCS 4338 pp.837-848, 2006. [PDF]

  • Pramod Sankar K., Saurabh Pandey and C. V. Jawahar - Text Driven Temporal Segmentation of Cricket Videos , 5th Indian Conference on Computer Vision, Graphics and Image Processing, Madurai, India, LNCS 4338 pp.433-444, 2006. [PDF]

  • Pramod Sankar K, Vamshi Ambati, Lakshmi Hari and C. V. Jawahar, - Digitizing A Million Books: Challenges for Document Analysis, Proceedings of Seventh IAPR Workshop on Document Analysis Systems, 2006 (LNCS 3872), pp 425-436. [PDF]






Understanding Text in Scene Images

Anand Mishra (homepage)


With the rapid growth of camera-based mobile devices, applications that answer questions such as, “What does this sign say?” are becoming increasingly popular. This is related to the problem of optical character recognition (OCR) where the task is to recognize text occurring in images. The OCR problem has a long history in the computer vision community. However, the success of OCR systems is largely restricted to text from scanned documents. Scene text, such as text occurring in images captured with a mobile device, exhibits a large variability in appearance. Recognizing scene text has been challenging, even for the state-of-the-art OCR methods. Many scene understanding methods recognize objects and regions like roads, trees, sky in the image successfully, but tend to ignore the text on the sign board. Towards filling this gap, we devise robust techniques for scene text recognition and retrieval in this thesis.

This thesis presents three approaches to address scene text recognition problems. First, we propose a robust text segmentation (binarization) technique, and use it to improve the recognition performance. We pose the binarization problem as a pixel labeling problem and define a corresponding novel energy function which is minimized to obtain a binary segmentation image. This method makes it possible to use standard OCR systems for recognizing scene text. Second, we present an energy minimization framework that exploits both bottom-up and top-down cues for recognizing words extracted from street images. The bottom-up cues are derived from detections of individual text characters in an image. We build a conditional random field model on these detections to jointly model the strength of the detections and the interactions between them. These interactions are top-down cues obtained from a lexicon-based prior, i.e., language statistics. The optimal word represented by the text image is obtained by minimizing the energy function corresponding to the random field model. The proposed method significantly improves the scene text recognition performance. Thirdly, we present a holistic word recognition framework, which leverages scene text image and synthetic images generated from lexicon words. We then recognize the text in an image by matching the scene and synthetic image features with our novel weighted dynamic time warping approach. This approach does not require any language statistics or language specific character-level annotations.

Finally, we address the problem of image retrieval using textual cues, and demonstrate large-scale text-to-image retrieval. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that this approach, despite being based on state-of-the art methods, is insufficient, and propose an approach without relaying on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database.

We evaluate our proposed methods extensively on a number of scene text benchmark datasets, namely, street view text, ICDAR 2003, 2011 and 2013, and a new dataset IIIT 5K-word, we introduced, and show better performance than all the comparable methods. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely, IIIT scene text retrieval, Sports-10K and TV series-1M, we introduced.


Year of completion:  December 2016
 Advisor : Prof. C.V. Jawahar & Dr. Karteek Alahari

Related Publications

  • Anand Mishra, Karteek Alahari and C. V. Jawahar - Enhancing energy minimization framework for scene text recognition with top-down cues - Computer Vision and Image Understanding (CVIU 2016), volume 145, pages 30–42, 2016. [PDF]

  • Anand Mishra, Karteek Alahari and C V Jawahar - Image Retrieval using Textual Cues Proceedings of International Conference on Computer Vision, 1-8th Dec.2013, Sydney, Australia. [Pdf] [Abstract] [Project page][bibtex]

  • Vibhor Goel, Anand Mishra, Karteek Alahari, C V Jawahar - Whole is Greater than Sum of Parts: Recognizing Scene Text Words Proceedings of the 12th International Conference on Document Analysis and Recognition, 25-28 Aug. 2013, Washington DC, USA. [PDF] [Abstract] [bibtex]

  • Anand Mishra, Karteek Alahari and C V Jawahar - Scene Text Recognition using Higher Order Language Priors Proceedings of British Machine Vision Conference, 3-7 Sep. 2012, Guildford, UK. [PDF] [Abstract] [Slides] [bibtex]

  • Anand Mishra, Karteek Alahari and C V Jawahar - Top-down and Bottom-up Cues for Scene Text Recognition Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 16-21 June 2012, pp. 2287-2294, Providence RI, USA. [PDF] [Abstract] [Poster] [bibtex]

  • Anand Mishra, Karteek Alahari and C.V. Jawahar - An MRF Model for Binarization of Natural Scene Text Proceedings of 11th International Conference on Document Analysis and Recognition (ICDAR 2011),18-21 September, 2011, Beijing, China. [PDF] [Abstract] [Slides] [bibtex]




Recognition and Retrieval from Document Image Collections

Million Meshesha

The present growth of digitization of books and manuscripts demands an immediate solution to access them electronically. This will enable the archived valuable materials to be searchable and usable by users in order to achieve their objectives. This requires research in the area of document image understanding, specifically in the area of document image recognition as well as document image retrieval. In the last three decades significant advancement is made in the recognition of documents written in Latin-based and some Oriental scripts. There are many excellent attempts in building robust document analysis systems in industry, academia and research labs. Intelligent recognition systems are commercially available for use for certain scripts. However, there is only limited research effort for the recognition of indigenous scripts of African and Indian languages. In addition, diversity of archived printed documents poses many challenges to document analysis and understanding. Hence in this work, we explore novel approaches for understanding and accessing the content of document image collections that vary in quality and printing.

In Africa around 2,500 languages are spoken. Some of these languages have their own indigenous scripts in which there is a bulk of printed documents available in the various institutions. Digitization of these documents enables us to harness already available language technologies to local information needs and developments. We present an OCR for converting digitized documents in Amharic language.Amharic is the official language of Ethiopia. Extensive literature survey reveals that this is the first attempt that reports the challenges toward the recognition of indigenous African scripts and a possible solution for Amharic script. Research in the recognition of Amharic script faces major challenges due to (i) the use of large number of characters in the writing and (ii) the existence of large set of visually similar characters. Here we extract a set of optimal discriminant features to train the classifier. Recognition results are presented on real-life degraded documents such as books, magazines and newspapers to demonstrate the performance of the recognizer.

The present OCRs are typically designed to work on a single page at a time. We argue that the recognition scheme for a collection (like a book) could be considerably different from that designed for isolated pages. The motivation here is therefore to exploit the entire available information (during the recognition process), which is not effectively used earlier for enhancing the performance of the recognizer. To this end, we propose %an architecture and learning algorithms of a self adaptable OCR framework for the recognition of document image collections. This approach enables the recognizer to learn incrementally and adapt to document image collections for performance improvement. We employ learning procedures to capture the relevant information available online, and feed it back to update the knowledge of the system. Experimental results show the effectiveness of our design for improving the performance of the recognizer on-the-fly, thereby adapting to a specific collection.

For indigenous scripts of African and Indian languages there is no robust OCR available. Designing such a system is also a long-term process for accessing the archived document images. Hence we explore the application of word spotting approach for retrieval of printed document images without explicit recognition. To this end, we propose an effective word image matching scheme that achieves high performance in presence of script variability, printing variations, degradations and word-form variations. A novel partial matching algorithm is designed for morphological matching of word form variants in a language. We employ a feature extraction scheme that extracts local features by scanning vertical strips of the word image. These features are then combined based on their discriminatory potential. We present detailed experimental results of the proposed approach on English, Amharic and Hindi documents.

Searching and retrieval from document image collections is challenging because of the scalability issues and computational time. We design an efficient indexing scheme to speed up searching for relevant document images. We identify the word set by clustering them into different groups based on their similarities. Each of these clusters are equivalent to a variation in printing, morphology, and quality. This is achieved by mapping IR principles (that are widely used in text processing) for relevance ranking. Once we cluster list of index terms that define the content of the document, they are indexed using inverted data structure. This data structure also provides scope for incremental clustering in a dynamic environment. The indexing scheme enables effective search and retrieval in image-domain that is comparable with text search engines. We demonstrate the application of the indexing process with the help of experimental results.

The proliferation of document images at large scale demands a solution that is effective to access document images at collection level. In this work, we investigate machine learning and information retrieval approaches that suits this demand. Machine learning schemes are effectively used to redesign existing approach to OCR development. The recognizer is enabled to learn from its experience and improve its performance over time on document image collections. Information retrieval (IR) principles are mapped to construct an indexing scheme for efficient content-based search and retrieval from document image collections. Existing matching scheme is redesigned to undertake morphological matching in the image domain. Performance evaluation using datasets from different languages shows the effectiveness of our approaches. Extension works are recommended that need further consideration in the future to further the state-of-the-art in document image recognition and document image retrieval.


Year of completion:  2008
 Advisor : Prof. C.V. Jawahar

Related Publications

  • C. V. Jawahar, A. Balasubramanian, Million Meshesha and Anoop M. Namboodiri - Retrieval of Online Handwriting by Synthesis and Matching Proceeding of the International Journal on Pattern Recognition, 42(7), 1445-1457, 2009. [PDF]

  • Million Meshesha and C. V. Jawahar - Matching word image for content-based retrieval from printed document images Proceeding of the International Journal on Document Analysis and Recognition, IJDAR 11(1), 29-38, 2008. [PDF]

  • Million Meshesha and C.V. Jawahar - Self Adaptable Recognizer for Document Image Collections Proceeding of the Second International Conference on Pattern Recognition Machine Learning (PReMI'2007), pp. 560-567, 18-22 Dec, 2007, Kolkata. [PDF]

  • Million Meshesha and C. V. Jawahar - Indigenous Scripts of African Languaes Proceeding of the African Journal of Indigenous Knowledge Systems, Vol. 6, Issue 2, pp. 132-142, 2007. [PDF]

  • Million Meshesha and C. V. Jawahar - Optical Character Recognition of Amharic Documents Proceeding of the Africal Journal of Information and Communcation Technology ISSN 1449-2679, Volume 3, Number 3, December 2007. [PDF]

  • Pramod Sankar K., Million Meshesha and C. V. Jawahar - Annotation of Images and videos based on Textual Content without OCR, Workshop on Computation Intensive Methods for Computer Vision(in conjuction with ECCV 2006), 2006. [PDF]

  • A. Balasubramanian, Million Meshesha and C. V. Jawahar - Retrieval from Document Image Collections, Proceedings of Seventh IAPR Workshop on Document Analysis Systems, 2006 (LNCS 3872), pp 1-12. [PDF]

  • Sachin Rawat, K. S. Sesh Kumar, Million Meshesha, Indineel Deb Sikdar, A. Balasubramanian and C. V. Jawahar - A Semi-Automatic Adaptive OCR for Digital Libraries, Proceedings of Seventh IAPR Workshop on Document Analysis Systems, 2006 (LNCS 3872), pp 13-24. [PDF]

  • Million Meshesha and C. V. Jawahar  - Recognition of Printed Amharic Documents, Proceedings of Eighth International Conference on Document Analysis and Recognition(ICDAR), Seoul, Korea 2005, Vol 1, pp 784-788. [PDF]

  • C. V. Jawahar, Million Meshesha and A. Balasubramanian, Searching in Document Images, Proceedings of the Indian Conference on Vision, Graphics and Image Processing(ICVGIP), Dec. 2004, Calcutta, India, pp. 622--627. [PDF]