Understanding Semantic Association Between Images and Text

Yashaswi Verma (homepage)


Since the last two decades, vast amounts of digital data have been created, a large portion of which is the visual data comprising images and videos. To deal with this large amount of data, it has become necessary to build systems that can help humans to efficiently organize, index and retrieve from such data. While modern search engines are quite efficient in text-based indexing and retrieval, their “visual cortex” is still evolving. One way to address this is to enable similar technologies as those used for textual data for archiving and retrieving the visual data. Also, in practice people find it more convenient to interact with visual data using text as an interface rather than a visual interface. However, this in turn would require describing images and videos using natural text. Since it is practically infeasible to annotate these large visual collections manually, we need to develop automatic techniques for the same. In this thesis, we present our attempts towards modelling and learning semantic associations between images and different forms of text such as labels, phrases and captions. Our first problem is that of tagging images with discrete semantic labels, also called image annotation. To address this, we describe two approaches. In the first approach, we propose a novel extension of the conventional weighted k-nearest neighbour algorithm that tries to address the issues of class-imbalance and incomplete-labelling that are quite common in the image annotation task. In the second approach, we first analyze why the conventional SVM algorithm, despite its strong theoretical properties, does not achieve as good results as nearest neighbour based methods on the image annotation task. Based on this analysis, we propose an extension of SVM by introducing a tolerance parameter into the hinge-loss. This additional parameter helps in making binary models tolerant to practical challenges such as incomplete-labelling, label ambiguity and structural overlap. Next we target the problem of image captioning and caption-based image retrieval. Rather than using either individual words or entire captions, we propose to make use of textual phrases for both these tasks; e.g., “aeroplane at airport”, “person riding”, etc. These phrases are automatically extracted from available data based on linguistic constraints. To generate a caption for a new image, first the phrases present in the neighbouring (annotated) images are ranked based onvisual similarity. These are then integrated into a pre-defined template for caption generation. During caption based image retrieval, a given query is first decomposed into such phrases, and images are then ranked based on their joint relevance with these phrases. Lastly, we address the problem of cross-modal image-text retrieval. For this, we first present a novel Structural SVM based formulation for this task. We show that our formulation is generic, and can be used with a variety of loss functions as well as feature vector based representations. Next, we try to model higher-level semantics in multi/cross-modal data based on shared category information. For this, we first propose the notion of cross-specificity, and then present a generic framework based on cross-specificity that can be used as a wrapper function over several cross-modal matching approaches, and helps in boosting their performance on the cross-modal retrieval task. We evaluate the proposed methods on a number of popular and relevant datasets. On the image annotation task, we achieve near state-of-the-art results under multiple evaluation metrics. On the image captioning task, we achieve superior results compared to conventional methods that are mostly based on visual cues and corpus statistics. On the cross-modal retrieval task, both our approaches provide compelling improvements over baseline cross-modal retrieval techniques.


Year of completion:  July 2017
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Ayushi Dutta, Yashaswi Verma, and and C.V. Jawahar - Automatic image annotation: the quirks and what works Multimedia Tools and Applications An International Journal [PDF]

  • Yashaswi Verma and C.V. Jawahar - A support vector approach for cross-modal search of images and texts Computer Vision and Image Understanding 154 (2017): 48-63. [PDF]

  • Yashaswi Verma, C.V. Jawahar - A Robust Distance with Correlated Metric Learning for Multi-Instance Multi-Label Data Proceedings of the ACM Multimedia, 2016, Amsterdam, The Netherlands. [PDF]

  • Yashaswi Verma, C.V. Jawahar - Image Annotation by Propagating Labels from Semantic Neighbourhoods International Journal of Computer Vision (IJCV), 2016. [PDF]

  • Yashaswi Verma, C. V. Jawahar - A Probabilistic Approach for Image Retrieval Using Descriptive Textual Queries Proceedings of the ACM Multimedia, 26-30 Oct 2015, Brisbane, Australia. [PDF]

  • Yashaswi Verma, C.V. Jawahar - Exploring Locally Rigid Discriminative Patched for Learning Relative Attributes Proceedings of the 26th British Machine Vision Conference, 07-10 Sep 2015, Swansea, UK. [PDF]

  • Yashaswi Verma and C.V. Jawahar - Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval Proceedings of British Machine Vision Conference, 01-05 Sep 2014, Nottingam, UK. [PDF]

  • Ramachandruni N. Sandeep, Yashaswi Verma and C.V. Jawahar - Relative Parts : Distinctive Parts of Learning Relative Attributes Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 23-28 June 2014, Columbus, Ohio, USA. [PDF]

  • Sandeep, Ramachandruni N, Yashaswi Verma and C.V. Jawahar - Relative parts: Distinctive parts for learning relative attributes Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. [PDF]

  • Yashaswi Varma and C V Jawahar - Exploring SVM for Image Annotation in Presence of Confusing Labels Proceedings of the 24th British Machine Vision Conference, 09-13 Sep. 2013, Bristol, UK. [PDF]

  • Yashaswi Verma, Ankush Gupta, Prashanth Mannem and C.V. Jawahar - Generating image descriptions using semantic similarities in the output space Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013. [PDF]

  • Yashaswi Verma and C V Jawahar - Neti Neti: In Search of Deity Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing, 16-19 Dec. 2012, Bombay, India. [PDF]

  • Yashaswi Varma and C V Jawahar - Image Annotation using Metric Learning in Semantic Neighbourhoods Proceedings of 12th European Conference on Computer Vision, 7-13 Oct. 2012, Print ISBN 978-3-642-33711--6, Vol. ECCV 2012, Part-III, LNCS 7574, pp. 114-128, Firenze, Italy. [PDF]

  • Ankush Gupta, Yashaswi Verma and C.V. Jawahar - Choosing Linguistics over Vision to Describe Images AAAI. 2012. [PDF]