CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Visual Grounding for Multi-modal Applications


Kanishk Jain

Abstract

The task of Visual Grounding is at the intersection of computer vision and natural language processing tasks. The Visual Grounding (VG) task requires spatially localizing an entity in a visual scene based on its linguistic description. The capability to ground language in the visual domain is of significant importance for many real-world applications, especially for human-machine interaction. One such application is language-guided navigation, where the navigation of autonomous vehicles is modulated using a linguistic command. The VG task is intimately linked with the task of vision-language navigation (VLN), as both the tasks require reasoning about the linguistic command and the visual scene simultaneously. Existing approaches to VG can be divided into two categories based on the type of localization performed: (1) bounding-box/proposal-based localization and (2) pixel-level localization. This work focuses on pixel-level localization, where the segmentation mask corresponding to the entity/region referred to by the linguistic expression is predicted. The research in this thesis focuses on a novel modeling strategy for visual and linguistic modalities for the VG task, followed by the first-ever visual grounding based approach to the VLN task. We first present a novel architecture for the task of pixel-level localization, also known as Referring Image Segmentation (RIS). The architecture is based on the hypothesis that both intra-modal (wordword and pixel-pixel) and inter-modal (word-pixel) interactions are required to identify the referred entity successfully. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intra-modal interactions. We address this limitation by performing all three interactions synchronously in a single step. We validate our hypothesis empirically against existing methods and achieve State-Of-the-Art results on RIS benchmarks. Finally, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from RIS, which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. We additionally introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset with segmentation masks for the regions described by the linguistic commands. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.

Year of completion:  December 2022
 Advisor : Vineet Gandhi

Related Publications


    Downloads

    thesis

    I-Do, You-Learn: Techniques for Unsupervised Procedure Learning using Egocentric Videos


    Siddhant Bansal

    Abstract

    Consider an autonomous agent capable of observing multiple humans making a pizza and making one the next time! Motivated to contribute towards creating systems capable of understanding and reasoning instructions at the human level, in this thesis, we tackle procedure learning. Procedure learning involves identifying the key-steps and determining their logical order to perform a task. The first portion of this thesis focuses on the datasets curated for procedure learning. Existing datasets commonly consist of third-person videos for learning the procedure, making the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action. To this end, for studying procedure learning from egocentric videos, we propose the EgoProceL dataset. However, procedure learning from egocentric videos is challenging because the camera view undergoes extreme changes due to the wearer’s head motion and introduces unrelated frames. Due to this, current state-of-the-art methods’ assumptions that the actions occur at approximately the same time and are of the same duration do not hold. Instead, we propose to use the signal provided by the temporal correspondences between key-steps across videos. To this end, we present a novel self-supervised Correspond and Cut (CnC) framework that identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. We perform experiments on the benchmark ProceL and CrossTask datasets and achieve state-of-the-art results. In the second portion of the thesis, we look at various approaches to generate the signal for learning the embedding space. Existing approaches use only one or a couple of videos for this purpose. However, we argue that it makes key-steps discovery challenging as the algorithms lack an inter-videos perspective. To this end, we propose an unsupervised Graph-based Procedure Learning (GPL) framework. GPL consists of the novel UnityGraph that represents all the videos of a task as a graph to obtain both intra-video and inter-videos context. Further, to obtain similar embeddings for the same key-steps, the embeddings of UnityGraph are updated in an unsupervised manner using the Node2Vec algorithm. Finally, to identify the key-steps, we cluster the embeddings using KMeans. We test GPL on benchmark ProceL, CrossTask, and EgoProceL datasets and achieve an average improvement of 2% on third-person datasets and 3.6% on EgoProceL over the state-of-the-art. We hope this work motivates future research on procedure learning from egocentric videos. Furthermore, the unsupervised approaches proposed in the thesis will help create scalable systems and drive future research toward creative solutions

    Year of completion:  February 2023
     Advisor : C V Jawahar, Chetan Arora

    Related Publications


      Downloads

      thesis

      Continual and Incremental Learning in Computer-aided Diagnosis Systems


      Prathyusha Akundi

      Abstract

      Deep Neural Networks (DNNs) have shown remarkable performance in a broad range of computer vision tasks, including in the medical domain. With the advent of DNNs, the medical community has witnessed significant developments in segmentation, classification, and detection. But this success comes with a cost of heavy reliance on the abundance of data. Medical data, however, is often highly limited in volume and quality due to sparsity of patient contact, variability in medical care, and privacy concerns. Hence, to train large networks we seek data from different sources. In such a scenario, it is of interest to design a model that learns continuously and adapts to datasets or tasks as and when they are available. However, one of the important steps to achieve such a never-ending learning process is to overcome Catastrophic Forgetting (CF) of previously seen data or tasks. CF refers to the significant degradation in performance on the old task/dataset. To avoid confusion, we call a training regime Continual Learning (CL) when CAD systems have to handle a sequence of datasets collected over time from different sites with different imaging parameters/populations. Similarly, Incremental Learning (IL) is when CAD systems have to learn new classes as and when new annotations are made available. The work described in this thesis address core aspects of both CL & IL and has been compared against the state-of-the-art methods. In this thesis, we assume that access to the data belonging to previously trained datasets or tasks is not available which makes both CL and IL processes even more challenging. We start with developing a CL system that learns sequentially on different datasets and handles CF using the Uncertainty mechanism. The system consists of an ensemble of models which are trained or finetuned on each dataset and considers the prediction from the model which has the least uncertainty. We then investigate a new way to tackle CF in CL by manifold learning, inspired by the defense mechanisms against adversarial attacks. Our method uses a ‘Reformer’ which is essentially a denoising autoencoder that ‘reforms’ or brings the data from all the datasets together towards a common manifold. These reformed samples are then passed to the network to learn the desired task. Towards IL, we propose a novel approach that ensures that a model remembers the causal factor behind the decisions on the old classes, while incrementally learning new classes. We introduce a common auxiliary task during the course of incremental training, whose hidden representations are shared across all the classification heads. All the experiments for both CL and IL are conducted on multiple datasets and have shown significant performance over the state-of-the-art methods

      Year of completion:  April 2023
       Advisor : Jayanthi Sivaswamy

      Related Publications


        Downloads

        thesis

        3D Interactive Solution for Neuroanatomy Education


        Mythri V

        Abstract

        Typically, anatomy is taught through dissection, 2D images, presentations, and cross-sections. While these methods are convenient, they are non-interactive and fail to capture the spatial relationships and functional aspects of anatomy. New visualization technologies such as virtual reality and 3D can compensate for the impediments and provide better understanding while captivating the students. With recent advances in the industry, the methods to provide a 3D experience are economical. In this thesis, we introduce a low-cost 3D-interactive anatomy system designed for an audience of typical medical college students in an anatomy class. The setup used to achieve 3D visualization is Dual projector polarization. While there are other ways to achieve 3D visualization, like alternate frame sequencing and virtual reality, this technique can target a large audience and requires minimum accessories for the setup enabling this to be a low-cost solution for an immersive 3D experience. The 3D interactive Neuroanatomy solution is an end-to-end framework capable of designing anatomy lessons and visualizing the 3D stereoscopic projection of those anatomy lessons. To ensure superior comprehension of students, we incorporate each teacher’s unique teaching approach while developing anatomy lessons by providing the ability to create their own lessons. We have created anatomy lessons based on the human brain which is a vital organ and has a complex anatomy. Our aim is to help medical students to understand the complexity of organ systems from not just an anatomical perspective but also a radiological perspective. We use annotations on clinical case data such as MRI, MRA, etc., to create 3D models for anatomy visualization incorporating clinical information and illustrating real cases. Annotations for structures of interest are done using manual, automatic, and semi-automatic segmentation methods. Manual delineation of the structure boundaries is very tedious and time-consuming. Automatic segmentation is quick and convenient. However, manual annotations were done for the 3D anatomy viewer for small and complex structures due to substandard automatic segmentation. There is a need to improve automatic segmentation performance for those structures. While segmentation is an essential step in 3D modeling, it plays a critical role in many neurological disease diagnoses as well, which are associated with degradation in the sub-cortical region. Therefore accurate algorithms are needed for sub-cortical structure segmentation. Variance in the size of structures is significant, which introduces a performance bias towards larger structures in many deep learning approaches. In this part of the thesis, we aim to remove size bias in sub-cortical structure segmentation. The proposed method addresses this problem with a pre-training step used to learn tissue characteristics, an ROI extraction step that aids in focusing on the local context, and using structure ROIs elevates the influence of smaller structures in the network.

        Year of completion:  April 2023
         Advisor : Jayanthi Sivaswamy

        Related Publications


          Downloads

          thesis

          Towards Generalization in Multi-View Pedestrian Detection


          Jeet Vora

          Abstract

          Detecting humans in images and videos has emerged as an essential aspect of intelligent video systems that solve pedestrian detection, tracking, crowd counting, etc. It has many real-life applications varying from visual surveillance and sports to autonomous driving. Despite achieving high performance, the single camera-based detection methods are susceptible to occlusions caused by humans, which drastically degrades the performance where crowd density is very high. Therefore multi-camera setup becomes necessary, which incorporates multiple camera views for detections by computing precise 3D locations that can be visualized and transformed to Top View also termed as Bird’s Eye View (BEV) representation and thus permits better occlusion reasoning in crowded scenes. The thesis, therefore, presents a multi-camera approach that globally aggregates the multi-view cues for detection and alleviates the impact of occlusions in a crowded environment. But it was still primarily unknown how satisfactorily the multi-view detectors generalize to unseen data. In different camera setups, this becomes critical because a practical multi-view detector should be usable in scenarios such as i) when the model trained with few camera views is deployed, and one of the cameras fails during testing/inference or when we add more camera views to the existing setup, ii) when we change the camera positions in the same environment and finally iii) when deploying the system on the unseen environment; an ideal multi-camera setup system should be adaptable to such changing conditions. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. We formalized three critical forms of generalization and outlined the experiments to evaluate them: generalization with i) a varying number of cameras, ii) varying camera positions, and finally, iii) to new scenes. We discover that existing state-of-the-art models show poor generalization by overfitting to a single scene and camera configuration. To address the concerns: (a) we generated a novel Generalized MVD (GMVD) dataset, assimilating diverse scenes with changing daytime, camera configurations, varying number of cameras, and (b) we discuss the properties essential to bring generalization to MVD and developed a barebones model to incorporate them. We performed a series of experiments on the WildTrack, MultiViewX, and the GMVD datasets to motivate the necessity to evaluate the generalization abilities of MVD methods and to demonstrate the efficacy of the developed approach.

          Year of completion:  April 2023
           Advisor : Vineet Gandhi

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Does Audio help in deep Audio-Visual Saliency prediction models?
            2. Development of Annotation Guidelines, Datasets and Deep Networks for Palm Leaf Manuscript Layout Understanding
            3. Situation Recognition for Holistic Video Understanding
            4. Computer Vision based Large Scale Urban Mobility Audit andParametric Road Scene Parsing
            • Start
            • Prev
            • 6
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            • 13
            • 14
            • 15
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.