CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Geometry-aware methods for efficient and accurate 3D reconstruction


Rajvi Shah

Abstract

Advancements in 3D sensing and reconstruction has made a huge leap for modeling large-scale environments from monocular images using structure from motion (SfM) and simultaneous localization and mapping (SLAM) algorithms. SfM and SLAM based 3D reconstruction has applications for digital archival and modeling of real-world objects and environments, visual localization for geo-tagging and information retrieval, and mapping and navigation for robotic and autonomous driving applications. In this thesis, we address problems in the area of large-scale structure from motion (SfM) for 3D reconstruction and localization. We introduce new methods for improving efficiency and accuracy of state-of-the-art pipeline for structure from motion. Large-scale SfM pipeline deals with large unorganized collections of images pertaining to a particular geographical site. These image collections are formed by either retrieving relevant images using textual queries from the Internet, or can be captured for the specific purpose of 3D modeling, mapping, and navigation. Internet image collections tend to be more noisy and present more challenges for reconstruction as compared to datasets captured with specific intention to reconstruct. In this thesis, we propose methods that help with organizing these large, unstructured, and noisy images into a structure that is useful for SfM methods, a match-graph (or a view-graph). We first propose a geometry-aware two stage approach for pairwise image matching that is both more efficient and superior in quality of correspondences. We then extend this idea to SfM pipeline and present an iterative multistage framework for coarse to fine 3D reconstruction. Finally, we suggest that a key to solving many of the reconstruction problems is to address the problem of filtering and improving the view-graph in a way that is specific to the underlying problem. To this effect, we propose a unified framework for view-graph selection and show its application to achieve multiple reconstruction objectives.

 

Year of completion:  December 2020
 Advisor : P J Narayanan

Related Publications

  • Rajvi Shah, Visesh Chari and P.J. Narayanan - View-graph Selection Framework for SfM European Conference on Computer Vision (ECCV), 2018, Munich, Germany [PDF]

  • Saumya Rawat, Siddhartha Gairola, Rajvi Shah and P.J. Narayanan - Find Me a Sky: A Data-Driven Method for Color-Consistent Sky Search and Replacement International Conference on Multimedia Modeling 2018: 216-228 [PDF]

  • Ishit Mehta, Parikshit Sakurikar, Rajvi Shah, P J Narayanan -  SynCam: Capturing sub-frame synchronous media using smartphones IEEE International Conference on Multimedia and Expo (ICME-2017 ).[PDF]

  • Aditya Singh, Saurabh Saini, Rajvi Shah and P. J. Narayanan - Learning to hash-tag videos with Tag2Vec Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing. ACM, 2016. [PDF]

  • Aditya Singh, Saurabh Saini, Rajvi Shah and P J Narayanan - From Traditional to Modern : Domain Adaptation for Action Classification in Short Social Video Clips 38th German Conference on Pattern Recognition (GCPR 2016) Hannover, Germany, September 12-15 2016. [PDF]

  • Rajvi Shah, Vanshika Srivastava, P.J. Narayanan - Geometry-aware Feature Matching for Structure from Motion Applications Proceedings of the IEEE Winter Conferenc on Applications of Computer Vision, 06-09 Jan 2015, Waikoloa Beach, USA. [PDF]

  • Rajvi Shah, Aditya Deshpande, P.J. Narayanan - Multistage SFM: Revisiting Incremental Structure from Motion Proceedings of the International Conference on 3D Vision,08-11 Dec 2014, Tokyo, Japan.[PDF]

  • Rajvi Shah and P. J. Narayanan - Interactive video manipulation using object trajectories and scene backgrounds IEEE Transactions on Circuits and Systems for Video Technology 23.9 (2013): 1565-1576. [PDF]

  • Rajvi Shah and P.J. Narayanan - Trajectory based Video Object Manipulation Proceedings of IEEE International Conference on Multimedia and Expo (ICME 2011),11-15 July, 2011, Barcelona, Spain. [PDF]

  • Rajvi Shah, P. J. Narayanan and Kishore Kothapalli - GPU-Accelerated Genetic Algorithms Proceedings of The 3rd Workshop on Parallel Architectures for Bio-inspired Algorithms(WPABA) in conjunction with Parallel Architectures for Compilation Techniques (PACT'10),11-15 Sep. 2010,Vienna, Austria. [PDF]


Downloads

thesis

Anatomical Structure Segmentation in Retinal Images with Some Applications in Disease Detection


Arunava Chakravarty

Abstract

Color Fundus (CF) imaging and Optical Coherence Tomography (OCT) are widely used by ophthalmologists to visualize the retinal surface and the intra-retinal tissue layers respectively. An accurate segmentation of the anatomical structures in these images is necessary to visualize and quantify the structural deformations that characterize retinal diseases such as Glaucoma, Diabetic Macular Edema (DME) and Age-related Macular Degeneration (AMD). In this thesis, we propose different frameworks for the automatic extraction of the boundaries of relevant anatomical structures in CF and OCT images. First, we address the problem of the segmentation of Optic Disc (OD) and Optic Cup (OC) in CF images to aid in the detection of Glaucoma. We propose a novel boundary-based Conditional Random Field (CRF) framework to jointly extract both the OD and OC boundaries in a single optimization step. Although OC is characterized by the relative drop in depth from the OD boundary, the 2D CF images lack explicit depth information. The proposed method estimates depth from CF images in a supervised manner using a coupled, sparse dictionary trained on a set of image-depth map (derived from OCT) pairs. Since our method requires a single CF image per eye during testing it can be employed in the large-scale screening of glaucoma where expensive 3D imaging is unavailable. Next, we consider the task of the intra-retinal tissue layer segmentation in cross-sectional OCT images which is essential to quantify the morphological changes in specific tissue layers caused by AMD and DME. We propose a supervised CRF framework to jointly extract the eight layer boundaries in a single optimization step. In contrast to the existing energy mini-mization based segmentation methods that employ handcrafted energy cost terms, we linearly parameterize the total CRF energy to allow the appearance features for each layer and the relative weights of the shape priors to be learned in a joint, end-to-end manner by employing the Structural Support Vector Machine formulation. The proposed method can aid the oph-thalmologists in the quantitative analysis of structural changes in the retinal tissue layers for clinical practice and large-scale clinical studies. Next, we explore the Level Set based Deformable Models (LDM) which is a popular energy minimization framework for medical image segmentation. We model the LDM as a novel Recurrent Neural Network (RNN) architecture called the Recurrent Active Contour Evolution Network (RACE-net). In contrast to the existing LDMs, RACE-net allows the curve evolution velocities to be learned in an end-to-end manner while minimizing the number of network parameters, computation time and memory requirements. Consistent performance of RACE-net on a diverse set of segmentation tasks such as the extraction of OD and OC in CF images, cell nuclei in histopathological images and left atrium in cardiac MRI volumes demonstrates its utility as a generic, off-the-shelf architecture for biomedical segmentation. Segmentation has many clinical applications especially in the area of computer aided diagnostics. We close this dissertation with some illustrative applications of the segmentation information. We consider the case of disease detection in CF and OCT images. We explore and benchmark two classification strategies for the detection of glaucoma from CF images based on deep learning and handcrafted features respectively. Both the methods use a combination of appearance features directly derived from the CF image and structural features derived from the OD and OC segmentation. We also construct a Normative Atlas for the macular OCT volumes to aid in the detection of AMD. The irregularities in the Bruch’s membrane caused by the deposit of drusen are modeled as deviations from the normal anatomy represented by the Atlas Mean Template

 

Year of completion:  Novenber 2019
 Advisor : Jayanthi Sivaswamy

Related Publications

  • Chakravraty A, Gaddipati DJ and Jayanthi Sivaswamy - Construction of a Retinal Atlas for Macular OCT Volumes ICIAR 2018, Portugal [PDF]

  • Chakravraty A and Jayanthi Sivaswamy - RACE-net: A Recurrent Neural Network for Biomedical Image Segmentation IEEE journal of biomedical and health informatics [PDF]

  • Arunava Chakravarty and Jayanthi Sivaswamy - Joint optic disc and cup boundary extraction from monocular fundus images Computer Methods and Programs in Biomedicine 147 (2017): 51-61. [PDF]

  • Chakravarty and Jayanthi Sivaswamy - End-to-End Learning of a Conditional Random Field for Intra-retinal Layer Segmentation in Optical Coherence Tomography Annual Conference on Medical Image Understanding and Analysis. Springer, Cham, 2017. [PDF]

  • Chakrabarty L, Joshi G.D, Chakravarty A, Raman G.V., Krishnadas S.R. and Sivaswamy J. (2016) - Automated Detection of Glaucoma From Topographic Features of the Optic Nerve Head in Color Fundus Photographs, Journal of Glaucoma, 25(7, pp.590-597. [PDF]

  • Arunava Chakravarty and Jayanthi Sivaswamy - Glaucoma Classification with a Fusion of Segmentation and Image-based Features Proc. of IEEE International Symposium on Bio-Medical Imaging(ISBI), 2016, 13 - 16 April, 2016, Prague. [PDF]

  • Ujjwal, Arunava Chakravarty, Jayanthi Sivaswamy - An Assistive Annotation System for Retinal Images Proceedings of the IEEE International Symposium on Biomedical Imaging : From Nano to Macro, 16-19 April 2015.[PDF]

  • Jayanthi Sivaswamy, S.R. Krishnadas, Arunava Chakravarthy, Gopal Datt Joshi, Ujjwal, Tabish Abbas Syed - A Comprehensive Retinal Image Dataset for the Assessment of Glaucoma from the Optic Nerve Head Analysis JSM Biomed Imaging Data Papers 2(1) : 1004 (BIDP: 2015). [PDF]

  • Arunava Chakravarthy, Jayanthi Sivaswamy - Coupled Sparse Dictionary for Depth-based Cup Segmentation from Single Color Fundus Image Proceedings of the MICCAI 2014, 14-18 Sep 2014, Boston,USA. [PDF]

  • M J J P Van Grinsven, Arunava Chakravarty, Jayathi Sivaswamy, T. Theelen, B. Van Ginneken, C I Sanchez - A Bag of Words Approach for Discriminating between Retinal Images Containing Exudates or Drusen Proceedings of the IEEE 10th International Symposium on Biomedical Imaging : From Nano to Macro, 07-11 April. 2013, San Franciso,CA,USA. [PDF]

  • Ujjwal, K. Sai Deepak, Arunava Chakravarty, Jayathi Sivaswamy - Visual Saliency based Bright Lesion Detection and Discrimination in Retinal Images Proceedings of the IEEE 10th International Symposium on Biomedical Imaging : From Nano to Macro, 07-11 April. 2013, San Franciso,CA,USA. [PDF]

  • Arunava Chakravarty, Jayanthi Sivaswamy - A Novel Approach for Quantification of Retinal Vessel Tortuosity using Quadratic Polynomial Decomposition Proceedings of the indian Conference on Medical Informatics and Telemedicine, 28-30 Mar. 2013, Kharagpur, INDIA.. [PDF]


Downloads

thesis

Recognizing People in Image and Videos


Vijay Kumar

Abstract

Cameras and mobile phones have become integral part of our everyday lives as they become portable, powerful and cheaper. We capture and share hundreds of pictures and videos with our friends, family and social connections. Similarly, large volume of such visual content is generated in surveillance, entertainment, and biometrics applications. Without any doubt, people are the most important objects that dominate in these visual content. For instance, photos taken in a family event or movie videos focus around humans. It is utmost important to automatically detect, identify and analyze people appearing in images to obtain a better understanding of these content and make decisions around them. In this thesis, we consider the problem of person detection and recognition in images. This is a well explored topic in vision community with a vast literature focused on these problems. The current state-of-the-art recognition systems are able to identify people with high degree of accuracy in scenarios where images have high resolution, contain visible and near frontal faces, and recognition systems have access to sufficiently large training gallery. However, these systems need significant improvement in challenging real-world applications such as surveillance or entertainment videos where one needs to handle several practical issues such as non-visibility of faces, limitation of training samples, domain mismatch, etc in addition to other instance variations such as pose, illumination, and resolution. While there are plenty of challenges pertaining to person recognition, we are interested in some of the open challenges that are relevant from the deployment perspective in diverse recognition scenarios. We first consider people detection that is a pre-requisite for a recognition system. We detect people in images by detecting their faces through an exemplar based detector. Exemplar approach detects faces through hough voting using an exemplar training database indexed with bag-of-words method. We propose two key ideas referred as “Visual phrases” and “Contextual weighting” into the exemplar approach that improves its performance significantly. We show that visual phrases which encode dependencies between visual features are discriminative and propose a strategy to incorporate them into exemplar voting. We also introduce the notion of spatial consistency for a visual feature which weights each occurrence of a feature based on its global context. Our evaluation of popular in-the-wild face detection benchmarks demonstrate significant improvement obtained using these proposed ideas. We then focus on person recognition and consider several issues encountered in practical recognition systems. We initially address the common and important issue of “unavailability of sufficient training samples” during recognition. We propose a solution based on semi-supervised learning that can efficiently learn from a small amount of labeled data and a large amount of unlabeled data. We demonstrate how the similarities between labeled and unlabeled samples can be effectively exploited to improve the performance. We then consider the problem of “domain mismatch” between training gallery (source) and probe instances (target). We consider a recognition setup in which the objective is to identify people in a collection of probe images using a training gallery collected from different domain. We propose a novel two-stage solution that generates the labels of a few confident seed images from the target domain and propagate their labels to remaining images using a graph based framework. We evaluate our approach in several practical recognition scenarios such as movie videos and photo-albums. We then consider a different recognition scenario in which faces are not completely visible due to occlusion or people facing away from the camera. To deal with such occluded and partially or completely non-visible faces, we exploit information from other body regions such as head, upper body and body to improve the recognition. When considering different body regions, pose of different body regions pose a serious challenge. To handle the issues of unreliable facial region and pose variation, we propose a technique that learns multiple pose-specific representations from different body regions. Our approach involves training a separate deep convolutional network for each pose and then combining their predictions using adaptive weights determined by the pose of the person. Person recognition approaches based on multiple body regions however require training multiple deep convolution networks for different body regions resulting in large number of parameters with slower training and testing procedures. To overcome these, we develop an end-to-end person recognition approach based on pooling and aggregation of discriminative features from multiple body regions. Our end-to-end convolutional network pools features from several pre-determined region of interests and adaptively aggregates them using an attention mechanism to produce a compact representation. We evaluate our single end-to-end trained model on multiple person recognition benchmarks and show its effectiveness over multiple models trained on different body regions. We finally note that all of our work is developed with a keen focus on their applicability in real world applications. We have created and publicly released datasets and source code during the process.

 

Year of completion:  July 2019
 Advisor : Anoop M Namboodiri and C.V. Jawahar

Related Publications

  • Vijay Kumar, Anoop Namboodiri and C.V. Jawahar - Semi-supervised annotation of faces in image collection. Signal, Image and Video Processing (2017). [PDF]

  • Vijay Kumar, Anoop Namboodiri, Manohar Paluri and C. V. Jawahar - Pose-Aware Person Recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [PDF]

  • Vijay Kumar, R. Raghavendra, Anoop Namboodiri and Christoph Busch - Robust transgender face recognition: Approach based on appearance and therapy factors Identity, Security and Behavior Analysis (ISBA), 2016 IEEE International Conference on. IEEE, 2016. [PDF]

  • Vijay Kumar R., Anoop M. Namboodiri, C.V. Jawahar - Visual Phrases for Exemplar Face Detection Proceedings of the International Conference on Computer Vision (ICCV 2015), 13-16 Dec 2015, Santiago, Chile. [PDF]

  • Hiba Ahsan, Vijay Kumar, C. V. Jawahar - Multi-Label Annotation of Music Proceedings of the Eighth International Conference on Advances in Pattern Recognition,04-07 Jan 2015, Kolkata, India. [PDF]

  • Vijay Kumar, Anoop M. Namboodiri and C.V. Jawahar - Face Recognition in Videos by Label Propogation Proceedings of the 22nd International Conference on Pattern Recognition, 24-28 Aug 2014, Stockholm, Sweden. [PDF]

  • Vijay Kumar, Harit Pandya and C.V. Jawahar - Identifying Ragas in India Music Proceedings of the 22nd International Conference on Pattern Recognition, 24-28 Aug 2014, Stockholm, Sweden.[PDF]

  • Shankar Setty, Moula Husain, Parisa Beham, Jyothi Gudavalli, Menaka Kandasamy, Radehsyam Vaddi, Vidyagouri Hemadri, JC Karure, Raja Raju, B.Rajan, Vijay Kumar and C V Jawahar - Indian Movie Face Database: A Benchmark for Face Recognition Under Wide Variations Proceedings of the IEEE National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, 18-21 Dec. 2013, Jodhpur, India. [PDF]

  • Vijay Kumar, Anoop M Namboodiri and C V Jawahar - Sparse Representation based Face Recognition with Limited Labeled Samples Proceedings of the 2nd Asian Conference Pattern Recognition, 05-08 Nov. 2013, Okinawa, Japan. [PDF]

  • Vijay Kumar, Amit Bansal, Goutam Hari Tulsiyan, Anand Mishra, Anoop M. Namboodiri, C V Jawahar - Sparse Document Image Coding for Restoration Proceedings of the 12th International Conference on Document Analysis and Recognition, 25-28 Aug. 2013, Washington DC, USA. [PDF]


Downloads

thesis

Human Pose Retrieval for Image and Video collections


Nataraj Jammalamadaka (homepage)

Abstract

With overwhelming amount of visual data on the internet, it is beyond doubt that a search capability for this data is needed. While several commercial systems have been built to retrieve images and videos using meta-data, many of the images and videos do not have a detailed descriptions. This problem has been addressed by content based image retrieval (CBIR) systems which retrieve the images by their visual content. In this thesis, we will demonstrate that images and videos can be retrieved using the pose of the humans present in them. Here pose is the 2D/3D spatial arrangement of anatomical body parts like arms and legs. Pose is an important information which conveys action, gesture and the mood of the person. Retrieving humans using pose has commercial implications in domains such as dance (query being a dance pose) and sports (query being a shot). In this thesis, we propose three pose representations that can be used for retrieval. Using one of these representations, we will build a real-time pose retrieval system over million movie frames. Our first pose representation is based on the output of human pose estimation algorithms (HPE) [5, 26, 80, 103] which estimate the pose of the person given an image. Unfortunately, these algorithms are not entirely reliable and often make mistakes. We solve this problem by proposing an evaluator that predicts if a HPE algorithm has succeeded. We describe the stages required to learn and test an evaluator, including the use of an annotated ground truth dataset for training and testing the evaluator, and the development of auxiliary features that have not been used by the (HPE) algorithm, but can be learnt by the evaluator to predict if the output is correct or not. To demonstrate our ideas, we build an evaluator for each of four recently developed HPE algorithms using their publicly available implementations: Andriluka et al. [5], Eichner and Ferrari [26], Sapp et al. [80], and Yang and Ramanan [103]. We demonstrate that in each case our evaluator is able to predict whether the algorithm has correctly estimated the pose or not. In this context we also provide a new dataset of annotated stickmen. Further, we propose innovative ways in which a pose evaluator can be used. Specifically, we show how a pose evaluator can be used to filter incorrect pose estimates, to fuse outputs from different HPE algorithms, and to improve a pose search application. Our second pose representation is inspired by poselets [13] which are body detectors. First, we introduce deep poselets for pose-sensitive detection of various body parts, that are built on convolutional neural network (CNN) features. These deep poselets significantly outperform previous instantiations of Berkeley poselets [13]. Second, using these detector responses, we construct a pose representation that is suitable for pose search, and show that pose retrieval performance is on par with the state of the artpose representations. The compared methods include Bag of visual words [85], Berkeley poselets [13] and Human pose estimation algorithms [103, 18, 68]. All the methods are quantitatively evaluated on a large dataset of images built from a number of standard benchmarks together with frames from Hollywood movies. Our third pose representations is based on embedding an image into a lower dimensional pose-sensitive manifold. Here we make the following contributions: (a) We design an optimized neural network which maps the input image to a very low dimensional space where similar poses are close by and dissimilar poses are farther away, and (b) We show that pose retrieval system using these low dimensional representation is on par with the deep poselet representation. Finally, we describe a method for real time video retrieval where the task is to match the 2D human pose of a query. A user can form a query by (i) interactively controlling a stickman on a web based GUI, (ii) uploading an image of the desired pose, or (iii) using the Kinect and acting out the query himself. The method is scalable and is applied to a dataset of 22 movies totaling more than three million frames. The real time performance is achieved by searching for approximate nearest neighbors to the query using a random forest of K-D trees. Apart from the query modalities, we introduce two other areas of novelty. First, we show that pose retrieval can proceed using a low dimensional representation. Second, we show that the precision of the results can be improved substantially by combining the outputs of independent human pose estimation algorithms. The performance of the system is assessed quantitatively over a range of pose queries.

 

Year of completion:  July 2017
 Advisor : C V Jawahar and Andrew Zisserman

Related Publications

  • Nataraj Jammalamadaka, Andrew Zisserman and C.V. Jawahar - Human pose search using deep networks Image and Vision Computing 59 (2017): 31-43. [PDF]

  • Nataraj Jammalamadaka, Andrew Zisserman, C.V. Jawahar - Human Pose Search using Deep Poselets Proceedings of the 11th IEEE International Conference on Automatic Face and Gesture Recognition, 04-08 May 2015, Ljubljana, Slovnia.[PDF]

  • Nataraj Jammalamadaka, Andrew Zisserman and C.V. Jawahar - Human pose search using deep poselets Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on. Vol. 1. IEEE, 2015. [PDF]

  • Digvijay Singh, Ayush Minocha, Nataraj Jammalamadaka and C. V. Jawahar - Real-time Face Detection, Pose Estimation and Landmark Localization Proceedings of the IEEE National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, 18-21 Dec. 2013, Jodhpur, India. [PDF]

  • Nataraj Jammalamadaka, Ayush Minocha, Digvijay Singh and C V Jawahar - Parsing Clothes in Unrestricted Images Proceedings of the 24th British Machine Vision Conference, 09-13 Sep. 2013, Bristol, UK. [PDF]

  • Digvijay Singh, Ayush Minosha, Nataraj Jammalamadaka and C.V. Jawahar - Near real-time face parsing Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2013 Fourth National Conference on. IEEE, 2013. [PDF]

  • Nataraj Jammalamadaka, Andrew Zisserman, Marcin Eichner, Vittorio Ferrari and C V Jawahar - Video Retrieval by Mimicking Poses Proceedings of ACM International Conference on Multimedia Retrieval, 5-8 June 2012, Article No.34, ISBN 978-1-4503-1329-2, Hong Kong. [PDF]

  • Nataraj Jammalamadaka, Andrew Zisserman, Marcin Eichner, Vittorio Ferrari and C V Jawahar - Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators Proceedings of 12th European Conference on Computer Vision, 7-13 Oct. 2012, Print ISBN 978-3-642-33711--6, Vol. ECCV 2012, Part-III, LNCS 7574, pp. 114-128, Firenze, Italy. [PDF]

  • Nataraj Jammalmadaka, Vikram Pudi and C. V. Jawahar - Efficient Search with Changing Similarity Measures on Large Multimedia Datasets Proc. of The International Multimedia Modelling Conference(MMM2007), LNCS 4352, Part-II, PP. 206-215, 2007. [PDF]

  • C. V. Jawahar, Balakrishna Chennupati, Balamanohar Paluri and Nataraj Jammalamadaka, Video Retrieval Based on Textual Queries , Proceedings of the Thirteenth International Conference on Advanced Computing and Communications, Coimbatore, December 2005. [PDF]


Downloads

thesis

Understanding Semantic Association Between Images and Text


Yashaswi Verma (homepage)

Abstract

Since the last two decades, vast amounts of digital data have been created, a large portion of which is the visual data comprising images and videos. To deal with this large amount of data, it has become necessary to build systems that can help humans to efficiently organize, index and retrieve from such data. While modern search engines are quite efficient in text-based indexing and retrieval, their “visual cortex” is still evolving. One way to address this is to enable similar technologies as those used for textual data for archiving and retrieving the visual data. Also, in practice people find it more convenient to interact with visual data using text as an interface rather than a visual interface. However, this in turn would require describing images and videos using natural text. Since it is practically infeasible to annotate these large visual collections manually, we need to develop automatic techniques for the same. In this thesis, we present our attempts towards modelling and learning semantic associations between images and different forms of text such as labels, phrases and captions. Our first problem is that of tagging images with discrete semantic labels, also called image annotation. To address this, we describe two approaches. In the first approach, we propose a novel extension of the conventional weighted k-nearest neighbour algorithm that tries to address the issues of class-imbalance and incomplete-labelling that are quite common in the image annotation task. In the second approach, we first analyze why the conventional SVM algorithm, despite its strong theoretical properties, does not achieve as good results as nearest neighbour based methods on the image annotation task. Based on this analysis, we propose an extension of SVM by introducing a tolerance parameter into the hinge-loss. This additional parameter helps in making binary models tolerant to practical challenges such as incomplete-labelling, label ambiguity and structural overlap. Next we target the problem of image captioning and caption-based image retrieval. Rather than using either individual words or entire captions, we propose to make use of textual phrases for both these tasks; e.g., “aeroplane at airport”, “person riding”, etc. These phrases are automatically extracted from available data based on linguistic constraints. To generate a caption for a new image, first the phrases present in the neighbouring (annotated) images are ranked based onvisual similarity. These are then integrated into a pre-defined template for caption generation. During caption based image retrieval, a given query is first decomposed into such phrases, and images are then ranked based on their joint relevance with these phrases. Lastly, we address the problem of cross-modal image-text retrieval. For this, we first present a novel Structural SVM based formulation for this task. We show that our formulation is generic, and can be used with a variety of loss functions as well as feature vector based representations. Next, we try to model higher-level semantics in multi/cross-modal data based on shared category information. For this, we first propose the notion of cross-specificity, and then present a generic framework based on cross-specificity that can be used as a wrapper function over several cross-modal matching approaches, and helps in boosting their performance on the cross-modal retrieval task. We evaluate the proposed methods on a number of popular and relevant datasets. On the image annotation task, we achieve near state-of-the-art results under multiple evaluation metrics. On the image captioning task, we achieve superior results compared to conventional methods that are mostly based on visual cues and corpus statistics. On the cross-modal retrieval task, both our approaches provide compelling improvements over baseline cross-modal retrieval techniques.

 

Year of completion:  July 2017
 Advisor : Prof. C.V. Jawahar

Related Publications

  • Ayushi Dutta, Yashaswi Verma, and and C.V. Jawahar - Automatic image annotation: the quirks and what works Multimedia Tools and Applications An International Journal [PDF]

  • Yashaswi Verma and C.V. Jawahar - A support vector approach for cross-modal search of images and texts Computer Vision and Image Understanding 154 (2017): 48-63. [PDF]

  • Yashaswi Verma, C.V. Jawahar - A Robust Distance with Correlated Metric Learning for Multi-Instance Multi-Label Data Proceedings of the ACM Multimedia, 2016, Amsterdam, The Netherlands. [PDF]

  • Yashaswi Verma, C.V. Jawahar - Image Annotation by Propagating Labels from Semantic Neighbourhoods International Journal of Computer Vision (IJCV), 2016. [PDF]

  • Yashaswi Verma, C. V. Jawahar - A Probabilistic Approach for Image Retrieval Using Descriptive Textual Queries Proceedings of the ACM Multimedia, 26-30 Oct 2015, Brisbane, Australia. [PDF]

  • Yashaswi Verma, C.V. Jawahar - Exploring Locally Rigid Discriminative Patched for Learning Relative Attributes Proceedings of the 26th British Machine Vision Conference, 07-10 Sep 2015, Swansea, UK. [PDF]

  • Yashaswi Verma and C.V. Jawahar - Im2Text and Text2Im: Associating Images and Texts for Cross-Modal Retrieval Proceedings of British Machine Vision Conference, 01-05 Sep 2014, Nottingam, UK. [PDF]

  • Ramachandruni N. Sandeep, Yashaswi Verma and C.V. Jawahar - Relative Parts : Distinctive Parts of Learning Relative Attributes Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 23-28 June 2014, Columbus, Ohio, USA. [PDF]

  • Sandeep, Ramachandruni N, Yashaswi Verma and C.V. Jawahar - Relative parts: Distinctive parts for learning relative attributes Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. [PDF]

  • Yashaswi Varma and C V Jawahar - Exploring SVM for Image Annotation in Presence of Confusing Labels Proceedings of the 24th British Machine Vision Conference, 09-13 Sep. 2013, Bristol, UK. [PDF]

  • Yashaswi Verma, Ankush Gupta, Prashanth Mannem and C.V. Jawahar - Generating image descriptions using semantic similarities in the output space Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013. [PDF]

  • Yashaswi Verma and C V Jawahar - Neti Neti: In Search of Deity Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing, 16-19 Dec. 2012, Bombay, India. [PDF]

  • Yashaswi Varma and C V Jawahar - Image Annotation using Metric Learning in Semantic Neighbourhoods Proceedings of 12th European Conference on Computer Vision, 7-13 Oct. 2012, Print ISBN 978-3-642-33711--6, Vol. ECCV 2012, Part-III, LNCS 7574, pp. 114-128, Firenze, Italy. [PDF]

  • Ankush Gupta, Yashaswi Verma and C.V. Jawahar - Choosing Linguistics over Vision to Describe Images AAAI. 2012. [PDF]


Downloads

thesis

More Articles …

  1. Understanding Text in Scene Images
  2. Aligning Textual & Visual Data : Towards Scalable Multimedia Retrieval
  3. Recognition and Retrieval from Document Image Collections
  4. Computational Displays - On Enhancing Displays using Computation
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Thesis
  5. Doctoral Dissertations
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.