CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Does Audio help in deep Audio-Visual Saliency prediction models?


Ritvik Agrawal

Abstract

The task of saliency prediction focuses on understanding and modeling human visual attention (HVA), i.e., where and what people pay attention to given visual stimuli. Audio has ideal properties to aid gaze fixation while viewing a scene. There exists substantial evidence of audio-visual interplay in human perception, and it is agreed that they jointly guide our visual attention. Learning computational models for saliency estimation is an effort to inch machines/robots closer to human cognitive abilities. The task of saliency prediction is helpful in many digital content-based applications like automated editing, perceptual video coding, human-robot interactions,etc. The field has progressed from using hand-crafted features to deep learning-based solutions. Efforts on static image saliency prediction methods are led by convolutional architectures. The ideas were extended to videos by integrating temporal information using 3D convolutions or LSTM’s. Many sophisticated multimodal, multi-stream architectures have been proposed to process multimodal information for saliency prediction. Despite existing works of Audio-Visual Saliency Prediction (AVSP) models claiming to achieve promising results by fusing audio modality over visual-only models, most of these models only consider visual cues and fail to leverage auditory information that is ubiquitous in dynamic scenes. In this thesis, we investigate the relevance of audio cues in conjunction with the visual ones and conduct extensive analysis to analyse the cause of AVSP models being superior by employing well-established audio modules and fusion techniques from diverse correlated audio-visual tasks. Our analysis on ten diverse saliency datasets suggests that none of the methods worked for incorporating audio. Our endeavour suggests that augmenting audio features ends up learning a predictive model agnostic to audio . Furthermore, we bring to light, why AVSP models show a gain in performance over visual-only models, though the audio branch is agnostic at inference. Our experiments clearly indicate that visual modality dominates the learning; the current models largely ignore the audio information. The observation is consistent while using three different audio backbones and four different fusion techniques and contrasts with the previous methods, which claim audio as a significant contributing factor. The performance gains are a byproduct of improved training and the additional audio branch seems to have a regularizing effect. We show that similar gains are achieved while sending random audio during training. Overall our work questions the role of audio in current deep AVSP models and motivates the community to a clear avenue for reconsideration of the complex architectures by demonstrating that simpler alternatives work equally well.

Year of completion:  April 2023
 Advisor : Vineet Gandhi

Related Publications


    Downloads

    thesis

    Development of Annotation Guidelines, Datasets and Deep Networks for Palm Leaf Manuscript Layout Understanding


    Sowmya Aitha

    Abstract

    Ancient paper documents and palm leaf manuscripts from the Indian subcontinent have made a significant contribution to the world literary and culture. These documents often have complex, uneven, and irregular layouts. The process of digitization and deciphering the content from these documents without human intervention pose difficulties in a broad range of areas, including language, script, layout, elements, position, and number of manuscripts per image. Large-scale annotated Indic manuscript image datasets are needed for this kind of research. In order to meet this objective, we present Indiscapes, the first dataset containing multi-regional layout annotations for ancient Indian manuscripts. We also adapt a fully convolutional deep neural network architecture for fully automatic, instance-level spatial layout parsing of manuscript images in order to deal with the challenges such as presence of dense, irregular layout elements, pictures, multiple documents per image and the wide variety of scripts. Eventually, We demonstrate the effectiveness of proposed architecture on images from the Indiscapes dataset. Despite advancements, the segmentation of semantic layout using typical deep network methods is not resistant to the complex deformations that are observed across semantic regions. This problem is particularly evident in the domain of Indian palm-leaf manuscripts, which has limited resources. Therefore, we present Indiscapes2, a new expansive dataset of various Indic manuscripts with semantic layout annotations, to help address the issue. Indiscapes2 is 150% larger than Indiscapes and contains materials from four different historical collections. In addition, we propose a novel deep network called Palmira for reliable, deformation-aware region segmentation in handwritten manuscripts. As a performance metric, we additionally report a boundary-centric measure called Hausdorff distance and its variations. Our tests show that Palmira offers reliable layouts and outperforms both strong baseline methods and ablative versions. We also highlight our results on Arabic, South-East Asian and Hebrew historical manuscripts to showcase the generalization capability of PALMIRA. Even though we have reliable deep-network based approaches for comprehending manuscript layout, these models implicitly assume one or two manuscripts per image during the process, whereas in a real-world scenario there are often cases where multiple manuscripts are typically scanned together into a scanned image to maximise scanner surface area and reduce manual labour. Now, making sure that each individual manuscript within a scanned image can be isolated (segmented) on a per-instance basis became the first essential step in understanding the content of a manuscript. Hence, there is a need for a precursor system which extracts individual manuscripts before downstream processing. The highly curved and deformed boundaries of manuscripts, which frequently cause them to overlap with each other, introduce another complexity when confronting issue. We introduce another new document image dataset named IMMI (Indic Multi Manuscript Images) to address these issues. We also present a method that generates synthetic images to augment sourced non-synthetic images in order to boost the efficiency of the dataset and facilitate deep network training. Adapted versions of current document instance segmentation frameworks are used in our experiments. The results demonstrate the efficacy of the new frameworks for the task. Overall, our contributions enable robust extraction of individual historical manuscript pages. This in turn, could potentially enable better performance on downstream tasks such as region-level instance segmentation, optical character recognition and word-spotting in historical Indic manuscripts at scale.

    Year of completion:  May 2023
     Advisor : Ravi Kiran Sarvadevabhatla

    Related Publications


      Downloads

      thesis

      Situation Recognition for Holistic Video Understanding


      Zeeshan Kha

      Abstract

      Video is a complex modality consisting of multiple events, complex action, humans, objects and their interactions densely entangled over time. Understanding videos has been the core and one of the most challenging problem in computer vision and machine learning. What makes it even harder is the lack of structured formulation of the task specially when long videos are considered consisting of multiple events and diverse scenes. Prior works in video understanding have tried to address the problem only in a sparse and a uni-dimensional way, for example action recognition, spatio-temporal grounding, question answering and, free form captioning. However it requires holistic understanding to fully capture all the events, actions, and relations between all the entities, and represent any natural scene with the highest detail in the most faithful way. It requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) through semantic role labeling is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This is one of the most dense video understanding task posing several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation due to the free form captions for representing the roles. In this work, we propose the addition of spatio-temporal grounding as an essential component of the structured prediction task in a weakly supervised setting, without requiring ground truth bounding boxes. Since evaluating free-form captions can be difficult and imprecise this not only improves the current formulation and the evaluation setup, but also improves the interpretability of the models decision, because grounding allows us to visualise where the model is looking while generating a caption. To this end we present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time

      Year of completion:  May 2023
       Advisor : C V Jawahar , Makarand Tapaswi

      Related Publications


        Downloads

        thesis

        Computer Vision based Large Scale Urban Mobility Audit andParametric Road Scene Parsing


        Durga Nagendra Raghava Kumar Modhugu

        Abstract

        The footprint of partial or fully autonomous vehicles is increasing gradually with time. The existenceand availability of the necessary modern infrastructure are crucial for the widespread use of autonomousnavigation. One of the most critical efforts in this direction is to build and maintain HD maps efficientlyand accurately. The information in HD maps is organized in various levels 1) Geometric layer, 2)Semantic layer and 3) Map prior’s layer. The conventional approaches to capturing and extractinginformation at different HD map levels rely heavily on huge sensor networks and manual annotation.This is not scalable to create HD maps for massive road networks. We propose two novel solutionsto address the mentioned problems in this work. The first solution deals with the generation of thegeometric layer with parametric information of the road scene and other one to update information onroad infrastructure and traffic violations in the semantic layer.Firstly, the creation of the geometric layer of the HD map requires understanding the road layout interms of structure, number of lanes, lane width, curvature, etc. Prediction of these attributes as part ofa generalizable parametric model with which road layout can be rendered would suite the creation of ageometric layer. Many previous works that tried to solve this problem rely only on ground imagery andare limited by the narrow field of view of the camera, occlusions, and perspective shortening. This workdemonstrates the effectiveness of using aerial imagery as an additional modality to overcome the abovechallenges. We propose a novel architecture, Unified, that combines aerial and ground imagery featuresto infer scene attributes. We quantitatively evaluate on the KITTI dataset and show that our Unifiedmodel outperforms prior works. Since this dataset is limited to road scenes close to the vehicle, we sup-plement the publicly available Argoverse dataset with scene attribute annotations and evaluate far-awayscenes. We quantitatively and qualitatively show the importance of aerial imagery in understanding roadscenes, especially in regions farther away from the ego-vehicle.Finally, we also propose a simple mobile imaging setup to address and audit several common prob-lems in urban mobility and road safety, which can enrich the information in a semantic layer of HDmaps. Recent computer vision techniques are used to identify street irregularities (including missinglane markings and potholes), absence of street lights, and defective traffic signs using videos obtainedfrom a moving camera-mounted vehicle. Beyond the inspection of static road infrastructure, we alsodemonstrate the applicability of mobile imaging solutions to spot traffic violations. We validate ourproposal on the long stretches of unconstrained road scenes covering over 2000Km and discuss practi-cal challenges in applying computer vision techniques at such a scale. Exhaustive evaluation is carried viiout on 257 long-stretches with unconstrained settings and 20 conditions-based hierarchical frame-levellabels for different timings, weather conditions, road type, traffic density, and state of road damage. Forthe first time, we demonstrate that large-scale analytics of irregular road infrastructure is feasible withexisting computer vision techniques.

        Year of completion:  December 2022
         Advisor : C V Jawahar

        Related Publications


          Downloads

          thesis

          Weakly supervised explanation generation for computer aided diagnostic systems


          Aniket Joshi

          Abstract

          Computer Aided Diagnosis (CAD) systems are developed to aid doctors and clinicians in diagnosis after interpreting and examining a medical image. CAD systems aids in performing the task more consistently. With the arrival of data-driven deep learning paradigm and availability of large amount of data in the medical domain, CAD systems are being developed to diagnose a large variety of diseases ranging from different types of cancers, heart and brain diseases, Alzheimer’s disease, and diabetic retinopathy, etc. These systems are highly competent in performing the task on which they are trained. Although such systems perform at par with the trained clinicians, they suffer from a limitation in that they are completely black box in nature and are trained only on image-level class labels. This poses a problem in deploying CAD systems as stand alone solutions for disease diagnosis. This is because decisions in the medical domain are about health of a patient and are well reasoned and backed by evidence, sometimes from multiple modalities. Hence, there is a critical need for CAD system’s decisions to be explainable. Restricting our focus to only image modality, a solution to design explainable CAD systems could be to train the system using both class labels and local annotations and derive the explanation in a fully supervised manner. However, getting these local annotations is very expensive, time-consuming, and infeasible in most circumstances. In this thesis we address this explainability and data scarcity problem and propose two different approaches towards the development of weakly supervised explainable CAD systems. Firstly, we aim to explain the classification decision by providing heatmaps denoting important regions of interest in the image, which helped the model make the prediction. In order to generate anatomically accurate heatmaps, we provide a mixed set of annotations to our model - class labels for the entire training set of images and rough localization of suspect regions for a smaller subset of images in the training set. The proposed approach is illustrated on two different disease classification tasks based on disparate image modalities - Diabetic macular edema (DME) classification from OCT slices and Breast Cancer detection from mammographic images. Good classification results are shown on public datasets, supplemented by explanations in the form of suspect regions; these are derived using just a third of images with local annotations, emphasizing the potential for generalisability of the proposed solution.

          Year of completion:  November 2021
           Advisor : Jayanthi Sivaswamy

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Improving the Efficiency of Fingerprint Recognition Systems
            2. Interactive Layout Parsing of Highly Unstructured Document Images
            3. Pose Based Action Recognition: Novel Frontiers for Fully Supervised Learning and Language aided Generalised Zero-Shot Learning
            4. Exploring Data Driven Graphics for Interactivity
            • Start
            • Prev
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            • 13
            • 14
            • 15
            • 16
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.