Towards Understanding Deep Saliency Prediction


Navyasri M

Abstract

Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer to human visual cognitive abilities. Data-driven efforts have dominated the landscape since the introduction of deep neural network architectures. In deep learning research, the choices in architecture design are often empirical and frequently lead to more complex models than necessary. The complexity, in turn, hinders the application requirements. In this work, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions. We review the existing state of the art models on these four components and propose novel and simpler alternatives. As a result, we propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks. SimpleNet is an optimized encoder-decoder architecture and brings notable performance gains on the SALICON dataset (the largest saliency benchmark). MDNSal is a parametric model that directly predicts parameters of a GMM distribution and is aimed to bring more interpretability to the prediction maps. The proposed saliency models can be inferred at 25fps, making them suitable for real-time applications. We also explore the possibility of improving saliency prediction in videos by using the image saliency models and existing work.

Year of completion:  May 2021
 Advisor : Vineet Gandhi

Related Publications


    Downloads

    thesis

    Retinal Image Synthesis


    Anurag Anil Deshmukh

    Abstract

    Medical imaging has been aiding diagnosis and treatment of diseases by creating visual representations of the interior of the human body. Experts hand-mark these images for abnormalities and diagnosis. Supplementing experts with these rich visualization has enabled detailed clinical analysis and rapid medical intervention. However, deep learning-based methods rely on abundantly large volumes of data for training. Procuring data for medical imaging applications is especially difficult because abnormal cases by definition are rare and the data, in general, requires experts for labelling. With Deep learning algorithms, data with high class imbalance or of insufficient variability leads to poor classification performance. Thus, alternate approaches like using generative modelling to artificially generate more data have been of interest. Most of these methods are GAN [11] based approaches. While they can be helpful with data imbalance they still require a lot of data to be able to generate realistic images. Additionally, a lot of these methods have been shown to work on natural images where the images are relatively noise-free and smaller artifacts aren’t as damaging. Thus, this thesis aims at providing synthesis methods which overcome the limitations of smaller datasets and noisy profile. We do this for two different modalities, Fundus imaging and Optical Coherence Tomography (OCT). Firstly, we present a fundus image synthesis method aimed at providing paired Optic Cup and Image data for Optic Cup (OC) Segmentation. The synthesis method works well on small datasets by minimising the information to be learnt by leveraging domain-specific knowledge and by providing most of the structural information to the network. We demonstrate this method’s advantages over a more direct synthesis method. We show how leveraging domain-specific knowledge can provide higher quality images and annotations. Inclusion of these generated images and their annotations in training of an OC segmentation model showed a significant improvement in performance, thus showing their reliability. Secondly, we present a novel unpaired image to image translation method which can introduce abnormality (Drusen) to OCT images while avoiding artifacts and preserving the noise profile. Comparison with other state-of-the-art images to image translation methods shows that our method is significantly better at preserving the noise profile and is better at generating morphologically accurate structures.

    Year of completion:  April 2021
     Advisor : Jayanthi Sivaswamy

    Related Publications


      Downloads

      thesis

      Human head pose and emotion analysis


      Aryaman Gupta

      Abstract

      Scene analysis has been a topic of great interest in computer vision. Humans are the most important and most complex subject involved in scene analysis. Humans exhibit different forms of expressions and behaviour with its environment. These interactions with its environment have been in study for a long time and to interpret these interactions various challenges and tasks have been identified. We focus on two tasks in particular: Head Pose estimation and Emotion recognition. Head poses are an important mean of non-verbal human communication and thus a crucial element in understanding human interaction with its environment. Head pose estimation allows a robot to estimate the region of focus of attention for an individual. Head pose estimation requires learning a model that computes the intrinsic Euler angles for pose (yaw, pitch, roll) from an input image of the human face. Annotating ground truth head pose angles for images in the wild is difficult and requires ad-hoc fitting procedures (which provides only coarse and approximate annotations). This highlights the need for approaches which can train on data captured in a controlled environment and generalize on the images in the wild (with varying appearance and illumination of the face). Most present day deep learning approaches which learn a regression function directly on the input images fail to do so. To this end, we propose to use a higher level representation to regress the head pose while using deep learning architectures. More specifically, we use the uncertainty maps in the form of 2D soft localization heatmap images over five facial keypoints, namely left ear, right ear, left eye, right eye and nose, and pass them through a convolutional neural network to regress the head-pose. We show head pose estimation results on two challenging benchmarks BIWI and AFLW and our approach surpasses the state of the art on both the datasets. We also propose a synthetically generated dataset for head pose estimation. Emotions are fundamental to human lives and decision-making. Human emotion detection can be helpful in understanding human mood, intent or choice of action. Recognizing emotions from images or video accurately is not easy for humans themselves and for machines it is even more challenging as humans express their emotions in different forms and there is a lack of temporal boundaries among emotions. Facial Expression Recognition has remained a challenging and interesting problem in computer vision. Despite efforts made in developing various methods for facial expression recognition, existing approaches lack generalizability when applied to unseen images or those that are captured in wild setting (i.e. the results are not significant). We propose use of facial action unit’s soft localization heatmap images for facial expression recognition. To account for lack of large well labelled dataset we propose a method for automated spectrogram annotation where we use two modalities(visual and textual) used in expression of emotion by humans to label one other modality(speech) for emotion recognition.

      Year of completion:  March 2021
       Advisor : Vineet Gandhi

      Related Publications


        Downloads

        thesis

        Leveraging Structural Cues for Better Training and Deployment in Computer Vision


        Shyamgopal Karthik

        Abstract

        With the growing use of computer vision tools in wide-ranging applications, it becomes imperative to understand and resolve issues in computer vision models when they are used in production settings for various applications. In particular, it is essential to understand that the model can be wrong quite frequently during deployment. Developing a better understanding of the mistakes made by a model can help mitigate and handle them without catastrophic consequences. To investigate the severity of mistakes, we first explore this in a simple classification setting. Even in this setting, understanding the severity of mistakes of difficult to quantify, especially since manually defining pairwise costs does not scale well for large-scale classification datasets. Therefore most works have used class taxonomies/hierarchies, which allow pairwise costs to be defined using graph distances. There has been increasing interest in building deep hierarchy-aware classifiers, aiming to quantify and reduce the severity of mistakes and not just count the number of errors. However, most of these works require the hierarchy to be available during training and cannot adapt to new hierarchies or even small modifications to the existing hierarchy without having to re-train the model. We explore a different direction for hierarchy-aware classification – amending mistakes by making post-hoc corrections by resorting to the classical Conditional Risk Minimization(CRM). Surprisingly, we find that this method is a far more suitable alternative than the works on deep hierarchy-aware classification; CRM preserves the base model’s top-1 accuracy and brings the most likely predictions of the model closer to the ground truth and is able to provide reliable probability estimates, unlike hierarchy-aware classifiers. We firmly believe that this serves as a very strong and useful baseline for future exploration in this direction. We turn our attention to a crucial problem in many video processing pipelines: visual(single) object tracking. In particular, we explore the long-term tracking scenario where given a target in the first frame of the video; the goal is to track the object throughout a (long) video during which the object may undergo occlusion, vary in appearance, or go out-of-view. The temporal aspect of videos also makes it an ideal scenario to understand the accumulation of errors that would not be otherwise seen if every image is independent. We hypothesize that there are three crucial abilities that a tracker must possess to be effective in the long-term tracking scenario, namely Re-Detection, Recovery and, Reliability. The tracker must be able to re-detect the target when the target goes out of the scene, and returns must recover from failure and track an object contiguously to be of practical utility. We propose a set of novel and comprehensive experiments to understand each of these aspects which give a thorough understanding of the strengths and limitations of various state-of-the-art tracking algorithms. We finally visit the problem of multi-object tracking. Unlike the problem of single-object tracking where the target is initialized in the first frame, the goal here is to track all objects of a particular category(such as pedestrians, vehicles, animals etc.). Since this problem does not require user-initialization, it has found use in wide-ranging real-time applications such as autonomous driving. The typical multiobject tracking pipeline follows the tracking-by-detection paradigm, i.e. an object detector is first used to detect all the objects in the scene. These detections are linked together to form the final trajectories using a combination of Spatio-temporal features and appearance/Re-Identification(ReID) features. The appearance features are extracted using a Convolutional Neural Network(CNN) trained on a corpus of labelled videos. Our central insight is that only the appearance model requires labelled videos in the entire pipeline, while the rest of the pipeline can be trained with just image-level supervision. Inspired by the recent successes in unsupervised contrastive learning which enforces the similarity in feature space between an image and its augmented version, we resort to a simple method that leverages the spatio-temporal consistency in videos to generate “natural” augmentations which are then used as pseudo-labels to train the appearance model. When integrated into the overall tracking pipeline, we find that this unsupervised appearance model can match the performance of its supervised counterparts in reducing the identity switches present in the trajectories, thereby saving costly video annotations that are impractical to scale up without sacrificing performance.

        Year of completion:  April 2021
         Advisor : Vineet Gandhi

        Related Publications


          Downloads

          thesis

          Super-resolution of Digital Elevation Models With Deep Learning Solutions


          Kubade Ashish Ashokrao

          Abstract

          Terrain, representing features of an earth surface, plays a crucial role in many applications such as simulations, hazard prevention and mitigation planning, route planning, analysis of surface dynamics, computer graphics-based games, entertainment, films, to name a few. With recent advancements in digital technology, these applications demand the presence of high-resolution details in the terrain. However, currently available public datasets, providing terrain scans in the form of Digital Elevation Models (DEMs) have low resolution compared with the terrain information available in other modalities like aerial images. Publicly available DEM datasets for most parts of the world have a resolution of 30 m whereas the aerial images or satellite images are available at a resolution of 50 cm. The cost involved in capturing of such high-resolution DEMs (HRDEMs) turns out to be a major hurdle for making such high-resolution available in the public domain. This motivates us to provide a software solution for generating high-resolution DEM from the existing low-resolution DEMs (LRDEMs). In natural image domain, super-resolution has set up higher benchmarks by incorporating deep learning based solutions. Despite such tremendous success in image super-resolution task using deep learning solutions, there are very few works that have used these powerful systems on DEMs to generate HRDEMs. A few of them used additional modalities as aerial images or satellite images, temporal sequence of DEMs etc., to generate high-resolution terrains. However, the applicability of these methods is highly subject to the available input formats. In this research effort, we explore a new direction in DEM super-resolution by using feedback neural networks. Availing the capability of feedback neural networks to redefine the features learned by shallow layers of the network, we design DSRFB, a DEM super-resolution architecture that generates high-resolution DEM with a super-resolution factor of 8X with minimal input. Our experiments on Pyrenees and Tyrol mountain range datasets show that DSRFB can perform near to the state-of-the-art without using information from any additional modalities like aerial images. Further, by understanding the limitations of DSRFB, which primarily occur in case of highly degraded low-resolution input. In such cases, the major structures are entirely lost and the reconstruction becomes challenging. In such cases, to avail the elevation cues from alternate sources of information becomes necessary. To utilize such information from other modalities, we inherit the attention mechanism from natural language processing (NLP) domain. We integrate the attention mechanism into the feedback network to present Attentional Feedback Module (AFM). Our proposed network, Attentional Feedback vivii Network (AFN) with AFM as a backbone, outperforms the state-of-the-art methods with the best margin of 7.2%. We also emphasize on the reconstruction of the structures across patch boundaries. While generating HRDEM by splitting large DEM tiles into patches, we propose to use overlapped tiles and generate an aggregated response to dilute the artefacts due to structural discontinuities. To summarize, in this research, we propose two methods DSRFB and AFN to generate a high- resolution DEM from existing low-resolution DEM. While DSRFB achieves near to the state-of-the-art performance, coupling DSRFB with attentional mechanism (i.e., AFN) outperforms state-of-the-art methods.

          Year of completion:  March 2021
           Advisor : Avinash Sharma,K S Rajan

          Related Publications


            Downloads

            thesis