Aligning Textual & Visual Data : Towards Scalable Multimedia Retrieval

Pramod Sankar Kompalli (homepage)

The search and retrieval of relevant images and videos from large repositories of multimedia, is acknowledged as one of the hard challenges of computer science. With existing pattern recognition solutions, one cannot obtain detailed, semantic description for a given multimedia document. Several limitations exist in feature extraction, classification schemes, along with the incompatibility of representations across domains. The situation will most likely remain so, for several years to come.

Towards addressing this challenge, we observe that several multimedia collections contain similar parallel information that are: i) semantic in nature, ii) weakly aligned with the multimedia and iii) available freely. For example, the content of a news broadcast is also available in the form of newspaper articles. If a correspondence could be obtained between the videos and such parallel information, one could access one medium using the other, which opens up immense possibilities for information extraction and retrieval. However, it is challenging to find the mapping between the two sources of data due to the unknown semantic hierarchy within each medium and the difficulty to match information across the different modalities. In this thesis, we propose novel algorithms that address these challenges.

Different &lt Multimedia, Parallel Information &gt pairs, require different alignment techniques, depending on the granularity at which entities could be matched across them. We choose four pairs of multimedia, along with parallel information obtained in the text domain, such that the data is both challenging and available on a large scale. Specifically, our multimedia consists of movies, broadcast sports videos and document images, with the parallel text coming from scripts, commentaries and language resources. As we proceed from one pair to the next, we discover an increasing complexity of the problem, due to a relaxation of the temporal binding between the parallel information and the multimedia. By addressing this challenge, we build solutions that perform increasingly fine-grained alignment between multimedia and text data.

The framework that we propose begins with an assumption that we could segment the multimedia and the text into meaningful entities that could correspond to each other. The problem then, is to identify features and learn to match a text-entity to a multimedia-segment (and vice versa). Such a matching scheme could be refined using additional constraints, such as temporal ordering and occurrence statistics. We build algorithms that could align across i) movies and scripts, where sentences from the script are aligned to their respective video-shots and ii) document images with lexicon, where the words of the dictionary are mapped to clusters of word-images extracted from the scanned books.

Further, we relax the constraint in the above assumption, such that the segmentation of the multimedia is not available apriori<\i>. The problem now, is to perform a joint inference of segmentation and annotation. We address this problem by building an over-complete representation of the multimedia. A large number of putative segmentations are matched against the information extracted from the parallel text, with the joint inference achieved through dynamic programming. This approach was successfully demonstrated on i) Cricket videos, which were segmented and annotated with information from online commentaries and ii) word-images, where sub-words called Character N-Grams, are accurately segmented and labelled using the text-equivalent of the word.

As a consequence of the approaches proposed in this thesis, we were able to demonstrate text-based retrieval systems over large multimedia collections. The semantic level at which we can retrieve information was made possible by the annotation with parallel text information. Our work also results in a large set of labeled multimedia, which could be used by sophisticated machine learning algorithms for learning new concepts. (more...)

Year of completion:	May 2015
Advisor :	Prof. C. V. Jawahar

Related Publications

Pramod Sankar K, R Manmatha and C V Jawahar - Large Scale Document Image Retrieval by Automatic Word Annotation International Journal on Document Analysis and Recognition (IJDAR):Volume 17, Issue 1(2014), Page 1-17. [PDF]
Udit Roy, Naveen Sankaran, Pramod Sankar K. and C V Jawahar - Character N-Gram Spotting on Handwritten Documents using Weakly-Supervised Segmentation Proceedings of the 12th International Conference on Document Analysis and Recognition, 25-28 Aug. 2013, Washington DC, USA. [PDF]
Shrey Dutta, Naveen Sankaran, Pramod Sankar K and C V Jawahar - Robust Recognition of Degraded Documents Using Character N-Grams Proceedings of 10th IAPR International Workshop on Document Analysis Systems 27-29 Mar. 2012, ISBN 978-1-4673-0868-7, pp. 130-134, Queensland, Australia. [PDF]
Sudha Praveen M., Pramod Sankar K., and C.V. Jawahar - Character n-Gram Spotting in Document Images Proceedings of 11th International Conference on Document Analysis and Recognition (ICDAR 2011),18-21 September, 2011, Beijing, China. [PDF]
Pramod Kompalli, C.V.Jawahar and R. Manmatha - Nearest Neighbor based Collection OCR Proceedings of Ninth IAPR International Workshop on Document Analysis Systems (DAS'10), pp. 207-214, 9-11 June, 2010, Boston, MA, USA. [PDF]
Pramod Sankar K, C. V. Jawahar and Andrew Zisserman - Subtitle-free Movie to Script Alignment Proceedings of British Machine Vision Conference (BMVC 09), 7-10 September, 2009, London, UK. [PDF]
Pramod Sankar K and C.V. Jawahar - Probabilistic Reverse Annotation for Large Scale Image Retrieval Proc of IEEE Computer Society Conference on Computer Vision and Pattern Recognition,Minneapolis, Minnesona, 18-23 June, 2007. [PDF]
Pramod Sankar K. and C.V. Jawahar - Enabling Search over Large Collections of Telugu Document Images-An Automatic Annotation Based Approach , 5th Indian Conference on Computer Vision, Graphics and Image Processing, Madurai, India, LNCS 4338 pp.837-848, 2006. [PDF]
Pramod Sankar K., Saurabh Pandey and C. V. Jawahar - Text Driven Temporal Segmentation of Cricket Videos , 5th Indian Conference on Computer Vision, Graphics and Image Processing, Madurai, India, LNCS 4338 pp.433-444, 2006. [PDF]
Pramod Sankar K, Vamshi Ambati, Lakshmi Hari and C. V. Jawahar, - Digitizing A Million Books: Challenges for Document Analysis, Proceedings of Seventh IAPR Workshop on Document Analysis Systems, 2006 (LNCS 3872), pp 425-436. [PDF]