Thesis Students

A study of Automatic Segmentation of 3-D Brain MRI and its application to Deformable Registration

Aabhas Majumdar

Abstract

Automated segmentation of cortical and non-cortical human brain structures has been hitherto approached using nonrigid registration followed by label fusion. We propose an alternative approach for this using a convolutional neural network (CNN) which classifies a voxel into one of many structures. Four different kinds of two-dimensional and three-dimensional intensity patches are extracted for each voxel, providing local and global (context) information to the CNN. The proposed approach is evaluated on five different publicly available datasets which differ in the number of labels per volume. The obtained mean Dice coefficient varied according to the number of labels, for example, it is 0.844 0.031 and 0.743 0.019 for datasets with the least (32) and the most (134) number of labels, respectively. These figures are marginally better or on par with those obtained with the current state-of-the-art methods on nearly all datasets, at a reduced computational time. The consistently good performance of the proposed method across datasets and no requirement for registration make it attractive for many applications where reduced computational time is necessary. Feature-based registration has been popular with a variety of features ranging from voxel intensity to Self-Similarity Context (SSC). In the second part of the thesis, we examine the question of how features learnt using various Deep Learning (DL) frameworks for segmentation can be used for deformable registration and whether this feature learning is necessary or not. We investigate the use of features learned by different DL methods in the current state-of-the-art discrete registration framework and analyze its performance on 2 publicly available datasets. We draw insights about the type of DL framework useful for feature learning. We consider the impact, if any, of the complexity of different DL models and brain parcellation methods on the performance of discrete registration. Our results indicate that the registration performance with DL features and SSC are comparable and stable across datasets whereas this does not hold for low level features.

Year of completion:	April 2022
Advisor :	Jayanthi Sivaswamy

Related Publications

Downloads

Saliency Estimation in Videos and Images

Samyak Jain

Abstract

With the growing data in terms of images and videos, it becomes imperative to derive a solution to filter out important/salient information in these data. Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer to human visual cognitive abilities. Developing such a solution can help reduce human effort and can be used in various applications like automatic cropping, segmentation etc. We approached this problem by investigating saliency estimation in images which is predicting the stimuli of the human visual system when exposed to an image. We start by identifying four key components of saliency models, i.e. , input features, multi-level integration, readout architecture, and loss functions. Data-driven efforts have dominated the landscape since the introduction of deep neural network architectures. In deep learning research, the choices in architecture design are often empirical and frequently lead to more complex models than necessary. The complexity, in turn, hinders the application requirements. We review other state-of-the-art models on these four components and propose two novel and simpler end-to-end architectures - SimpleNet and MDNSal. They are straightforward, neater, minimal, and more interpretable than other architectures achieving state-of-the-art performance on public saliency benchmarks. SimpleNet is an optimized encoder-decoder architecture. MDNSal is a parametric model that directly predicts parameters of a GMM distribution and aims to bring more interpretability to the prediction maps. Conclusively, we suggest that the way to move forward is not necessarily to design complex architectures but a modular analysis to optimize each component and possibly explore novel (and simpler) alternatives. After exploring these components, we shifted our focus to saliency estimation in videos where the stimuli of users are captured when exposed to a dynamic scenario. We propose ViNet architecture for the task of video saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time. ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behavior in the previous state-of-the-art models for audio-visual saliency prediction. Our findings contrast with earlier works on deep learningbased audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio more effectively

Year of completion:	February 2022
Advisor :	Vineet Gandhi

Related Publications

Downloads

Fast and Accurate Image Recognition

Sri Aurobindo Munagala

Abstract

DNNs (Deep Neural Networks) have found use in a variety of applications recently, and have become much larger and resource-hungry over the years. CNNs (Convolutional Neural Networks) are widely used in Computer Vision tasks, which make use of the convolution operation in successive layers – making them computationally expensive to run. Due to the large size and computation required to run modern models, it is difficult to use them in resource-constrained scenarios, such as mobiles and embedded devices. Also, the amount of data created and collected is growing at a rapid rate, and annotating large amounts of data to use it for training DNNs is expensive. There are approaches which seek to intelligently query samples to train upon in an iterative fashion, such as Active Learning – but AL setups are themselves very costly to run due to full training of large models many times in the process. In this thesis, we explore methods to achieve extremely high speedups in both CNN model inference, and train-times for AL setups. Various paradigms to achieve fast CNN inference have been explored in the past, two major ones being binarization and pruning. Binarization involves the quantization of weights and/or inputs of the network from 32-bit full precision floats into a {-1,+1} space, with the aim of both achieving compression (as singular bits occupy 32 times less space than 32-bit floats) and speedups (as bit-bit operations can be done faster). Network pruning, on the other hand, tries to identify and remove redundant parts of the network in an unstructured (individual weights) or structured (channels/layers) manner to create sparse and efficient networks. While both these paradigms – binarization and pruning – have demonstrated great efficacy in achieving speedups and compression in CNNs, little work has been done in attempting to combine both approaches together. We argue that these paradigms are complementary, and can be combined to offer high levels of compression and speedup without any significant accuracy loss. Intuitively, weights/activations closer to zero have higher binarization error making them good candidates for pruning. We propose a novel Structured Ternary-Quantized Network that incorporates speedups from binary convolution algorithms through structured pruning, enabling the removal of pruned parts of the network entirely post-training. Our approach beats previous works attempting the same by a significant margin. Overall, our method brings up to 89x layer-wise compression over the corresponding full-precision networks – achieving only 0.33% loss on CIFAR-10 with ResNet-18 with a 40% PFR (Prune Factor Ratio for filters), and 0.3% on ImageNet with ResNet-18 with a 19% PFR. We also explore the field of AL (Active Learning), which is used in scenarios where unlabelled data is abundant, but annotation costs for that data make it infeasible to utilize all data for supervised training. AL methods initially train a model with a small pool of annotated data, and use the model’s (referred to as the selection model) predictions on the rest of the unlabelled data to form a query for annotating more samples. The training-query process iteratively happens till a sufficient amount of data is labelled. However, in practice, this sample selection practice in AL setups takes a lot of time as the selection model is fully re-trained on the labelled pool every iteration. We offer two improvements to the standard AL setup to bring down the overall train time required for sample selection significantly: the first, the introduction of a “memory” that enables us to train the selection model for fewer samples every round as opposed to the entire labelled dataset so far, and the second, the use of fast convergence techniques to reduce the number of epochs the selection model trains for. Our proposed improvements can work in tandem with previous techniques such as the use of proxy models for selection, and the combined improvements can bring more than 75x speedups in overall sample selection time to standard AL setups, making them feasible and easy to run in real-life scenarios where procuring large amounts of data is easy, but labelling them is difficult.

Year of completion:	January 2022
Advisor :	Anoop M Namboodiri

Related Publications

Downloads

Towards Understanding and Improving the Generalization Performance of Neural Networks

sarath sivaprasad

Abstract

The widespread popularity of over-parameterized deep neural networks (NNs) is backed by its ‘unreasonable’ performance on unseen data that is Independent, Identically Distributed with respect to the train data. The generalization of NNs cannot be reasoned with the traditional machine learning wisdom that the increase in the number of parameters leads to overfitting on train samples and subsequently reduced generalization. Various generalization measures have been proposed in recent times to explain the generalization property of deep learning networks. Despite some promising investigations, there is little consensus on how we can explain the generalization of NNs. Furthermore, the ability of neural networks to fit any train data for any random configuration of labels makes it more challenging to explain their generalization performance. Despite this ability to completely fit any given data, the neural network seems to be able to ‘cleverly’ learn a generalizing solution. We hypothesize that the ‘simple’ solution lies in a constrained subspace of the hypothesis space. We propose a constraint formulation of neural networks to close the generalization gap. We show that, through a principled constraint, we can achieve comparable train test performance for neural networks. We propose a way to constrain each output of the neural network as the convex combination of its inputs to ensure certain desirable geometry of the decision boundaries. This document covers two major aspects. Firstly, we show the improved generalization of neural networks using convex constraints. The second section goes beyond the IID setting and investigates the generalization of neural networks on the Out Of Distribution test sets.In the first section of the document, we investigate the constrained formulation of neural networks where the output is a convex function of the input. We show that the convexity constraints can be enforced on both fully connected and convolutional layers, making them applicable to most architectures. The convexity constraints include restricting the weights (for all but the first layer) to be non-negative and using a non-decreasing convex activation function. Albeit simple, these constraints have profound implications on the generalization abilities of the network. We draw three valuable insights: (a) Input Output Convex Neural Networks (IOC-NNs) self regularize and significantly reduce the problem of overfitting; (b) Although heavily constrained, they outperform the base multi-layer perceptrons and achieve similar performance as compared to base convolutional architectures and (c) IOCNNs show robustness to noise in train labels. We demonstrate the efficacy of the proposed idea using thorough experiments and ablation studies on six commonly used image classification datasets with three different neural network architectures.In the second section, we revisit the ability of networks to completely fit any given data and yet ‘cleverly’ learn the generalizing hypothesis from the large number of variances that can explain the train data. In accordance with concurrent findings, our explorations show that neural networks learn the most ‘low-lying’ variance in the data. They learn the features that easily correlate with the label and do no further exploration after finding such a solution. With this insight, we reinvent the need to understand the generalization of neural networks and improve them. We go beyond traditional IID and OOD evaluation benchmarks to further our understanding of learning in deep networks. Through our explorations, we give a possible explanation as to why neural networks can do well in certain benchmarks and why other inventive methods fail to give any consistent improvement over a simple neural network. Domain Generalization (DG) requires a model to learn a hypothesis from multiple distributions that generalizes to an unseen distribution. DG has been perceived as a front face of OOD generalization. We present empirical evidence to show that the primary reason for generalization in DG is the presence of multiple domains while training. Furthermore, we show that methods for generalization in IID are equally important for generalization in DG. Tailored methods fail to add performance gains in the Traditional DG (TDG) evaluation. Our experiments prompt if TDG has outlived its usefulness in evaluating OOD generalization? To further strengthen our investigation, we propose a novel evaluation strategy, ClassWise DG (CWDG), where for each class, we randomly select one of the domains and keep it aside for testing. We argue that this benchmarking is closer to human learning and relevant in real-world scenarios. Counter-intuitively, despite being exposed to all domains during training, CWDG is more challenging than TDG evaluation. While explaining the observations, our work makes a case for more fundamental analysis around the DG problem before exploring new ideas to tackle it Keywords – generalization of deep networks, constrained formulation, input-output-convex neural network, robust generalization bounds, explainable decision boundaries, mixture of experts

Year of completion:	May 2022
Advisor :	Vineet Gandhi

Related Publications

Downloads

Exploiting Cross-Modal Redundancy for Audio-Visual Generation

Sindhu B Hegde

Abstract

We interact with the world around us through multiple sensory streams of information such as audio, vision, and text (language). Each of these streams complement each other, but also contain redundant information, albeit in different forms. For example, the content of a person speaking can be captured by listening to the sounds in the speech, or partially understood by looking at the speaker’s lip movements, or by reading out the text transcribed from the vocal speech. This redundancy across modalities is utilized in human perceptual understanding that helps us to solve various practical problems. However, in the real-world, more often than not, information in individual streams is corrupted by various types of degradation like electronic transmission, background noise, and blurring which lead to deterioration in the content quality. In this work, we aim to recover the distorted signal in a given stream by exploiting the redundant information in another stream. Specifically, we deal with talking-face videos involving vision and speech signals. We propose two core ideas to explore cross-modal redundancy: (i) denoising speech using visual assistance, and (ii) upsampling very low-resolution talking-face videos using audio assistance. The first part focuses on the task of speech denoising. We show that the visual stream helps in distilling the clean speech from the corrupted signal by suppressing the background noise. We identify the key issues prevailing in the existing state-of-the-art speech enhancement works: (i) most of the current works use only the audio stream and are limited in their performance in a wide range of realworld noises, and (ii) few recent works use the lip-movements as additional cues with an aim to improve the quality of the generated speech over “audio-only” methods. However, they cannot be applied for several applications where the visual stream is unreliable or completely absent. Thus, in this work, we propose a new paradigm for speech enhancement: “pseudo-visual” approach, where the visual stream is synthetically generated from the noisy speech input. We demonstrate that the robustness and the accuracy boost obtained from our model lead to various real-world applications which were previously not possible. In the second part, we explore an interesting question of what can be obtained from an 8 × 8 pixel video sequence by utilizing the corresponding speech of the person talking. Surprisingly, it turns out to be quite a lot. We show that when processed with the right set of audio and image priors, we can obtain a full-length talking video sequence with a 32× scale-factor. When the semantic information about the identity, including basic attributes like age and gender, are almost entirely lost in the input low-resolution video, we show that utilizing the speech that accompanies the low-resolution video aids in recovering the key face attributes. Our proposed audio-visual upsampling network generates realistic, accurate, and high-resolution (256 × 256 pixels) talking-face videos from an 8 × 8 input video. Finally, we demonstrate that our model can be utilized in video conferencing applications where the network bandwidth consumption can be drastically reduced. We hope that our work in cross-modal content recovery enables exciting applications such as smoother video calling, accessibility of video content in low-bandwidth situations, and restoring old historical videos. Our work can also pave the way for future research avenues for cross-modal enhancement of talking-face videos

Year of completion:	June 2022
Advisor :	C V Jawahar,Vinay P Namboodiri

A study of Automatic Segmentation of 3-D Brain MRI and its application to Deformable Registration

Aabhas Majumdar

Abstract

Related Publications

Downloads

Saliency Estimation in Videos and Images

Samyak Jain

Abstract

Related Publications

Downloads

Fast and Accurate Image Recognition

Sri Aurobindo Munagala

Abstract

Related Publications

Downloads

Towards Understanding and Improving the Generalization Performance of Neural Networks

sarath sivaprasad

Abstract

Related Publications

Downloads

Exploiting Cross-Modal Redundancy for Audio-Visual Generation

Sindhu B Hegde

Abstract

Related Publications

Downloads

More Articles …