Thesis Students

Fast and Accurate Image Recognition

Sri Aurobindo Munagala

Abstract

DNNs (Deep Neural Networks) have found use in a variety of applications recently, and have become much larger and resource-hungry over the years. CNNs (Convolutional Neural Networks) are widely used in Computer Vision tasks, which make use of the convolution operation in successive layers – making them computationally expensive to run. Due to the large size and computation required to run modern models, it is difficult to use them in resource-constrained scenarios, such as mobiles and embedded devices. Also, the amount of data created and collected is growing at a rapid rate, and annotating large amounts of data to use it for training DNNs is expensive. There are approaches which seek to intelligently query samples to train upon in an iterative fashion, such as Active Learning – but AL setups are themselves very costly to run due to full training of large models many times in the process. In this thesis, we explore methods to achieve extremely high speedups in both CNN model inference, and train-times for AL setups. Various paradigms to achieve fast CNN inference have been explored in the past, two major ones being binarization and pruning. Binarization involves the quantization of weights and/or inputs of the network from 32-bit full precision floats into a {-1,+1} space, with the aim of both achieving compression (as singular bits occupy 32 times less space than 32-bit floats) and speedups (as bit-bit operations can be done faster). Network pruning, on the other hand, tries to identify and remove redundant parts of the network in an unstructured (individual weights) or structured (channels/layers) manner to create sparse and efficient networks. While both these paradigms – binarization and pruning – have demonstrated great efficacy in achieving speedups and compression in CNNs, little work has been done in attempting to combine both approaches together. We argue that these paradigms are complementary, and can be combined to offer high levels of compression and speedup without any significant accuracy loss. Intuitively, weights/activations closer to zero have higher binarization error making them good candidates for pruning. We propose a novel Structured Ternary-Quantized Network that incorporates speedups from binary convolution algorithms through structured pruning, enabling the removal of pruned parts of the network entirely post-training. Our approach beats previous works attempting the same by a significant margin. Overall, our method brings up to 89x layer-wise compression over the corresponding full-precision networks – achieving only 0.33% loss on CIFAR-10 with ResNet-18 with a 40% PFR (Prune Factor Ratio for filters), and 0.3% on ImageNet with ResNet-18 with a 19% PFR. We also explore the field of AL (Active Learning), which is used in scenarios where unlabelled data is abundant, but annotation costs for that data make it infeasible to utilize all data for supervised training. AL methods initially train a model with a small pool of annotated data, and use the model’s (referred to as the selection model) predictions on the rest of the unlabelled data to form a query for annotating more samples. The training-query process iteratively happens till a sufficient amount of data is labelled. However, in practice, this sample selection practice in AL setups takes a lot of time as the selection model is fully re-trained on the labelled pool every iteration. We offer two improvements to the standard AL setup to bring down the overall train time required for sample selection significantly: the first, the introduction of a “memory” that enables us to train the selection model for fewer samples every round as opposed to the entire labelled dataset so far, and the second, the use of fast convergence techniques to reduce the number of epochs the selection model trains for. Our proposed improvements can work in tandem with previous techniques such as the use of proxy models for selection, and the combined improvements can bring more than 75x speedups in overall sample selection time to standard AL setups, making them feasible and easy to run in real-life scenarios where procuring large amounts of data is easy, but labelling them is difficult.

Year of completion:	January 2022
Advisor :	Anoop M Namboodiri

Related Publications

Downloads

Towards Understanding and Improving the Generalization Performance of Neural Networks

sarath sivaprasad

Abstract

The widespread popularity of over-parameterized deep neural networks (NNs) is backed by its ‘unreasonable’ performance on unseen data that is Independent, Identically Distributed with respect to the train data. The generalization of NNs cannot be reasoned with the traditional machine learning wisdom that the increase in the number of parameters leads to overfitting on train samples and subsequently reduced generalization. Various generalization measures have been proposed in recent times to explain the generalization property of deep learning networks. Despite some promising investigations, there is little consensus on how we can explain the generalization of NNs. Furthermore, the ability of neural networks to fit any train data for any random configuration of labels makes it more challenging to explain their generalization performance. Despite this ability to completely fit any given data, the neural network seems to be able to ‘cleverly’ learn a generalizing solution. We hypothesize that the ‘simple’ solution lies in a constrained subspace of the hypothesis space. We propose a constraint formulation of neural networks to close the generalization gap. We show that, through a principled constraint, we can achieve comparable train test performance for neural networks. We propose a way to constrain each output of the neural network as the convex combination of its inputs to ensure certain desirable geometry of the decision boundaries. This document covers two major aspects. Firstly, we show the improved generalization of neural networks using convex constraints. The second section goes beyond the IID setting and investigates the generalization of neural networks on the Out Of Distribution test sets.In the first section of the document, we investigate the constrained formulation of neural networks where the output is a convex function of the input. We show that the convexity constraints can be enforced on both fully connected and convolutional layers, making them applicable to most architectures. The convexity constraints include restricting the weights (for all but the first layer) to be non-negative and using a non-decreasing convex activation function. Albeit simple, these constraints have profound implications on the generalization abilities of the network. We draw three valuable insights: (a) Input Output Convex Neural Networks (IOC-NNs) self regularize and significantly reduce the problem of overfitting; (b) Although heavily constrained, they outperform the base multi-layer perceptrons and achieve similar performance as compared to base convolutional architectures and (c) IOCNNs show robustness to noise in train labels. We demonstrate the efficacy of the proposed idea using thorough experiments and ablation studies on six commonly used image classification datasets with three different neural network architectures.In the second section, we revisit the ability of networks to completely fit any given data and yet ‘cleverly’ learn the generalizing hypothesis from the large number of variances that can explain the train data. In accordance with concurrent findings, our explorations show that neural networks learn the most ‘low-lying’ variance in the data. They learn the features that easily correlate with the label and do no further exploration after finding such a solution. With this insight, we reinvent the need to understand the generalization of neural networks and improve them. We go beyond traditional IID and OOD evaluation benchmarks to further our understanding of learning in deep networks. Through our explorations, we give a possible explanation as to why neural networks can do well in certain benchmarks and why other inventive methods fail to give any consistent improvement over a simple neural network. Domain Generalization (DG) requires a model to learn a hypothesis from multiple distributions that generalizes to an unseen distribution. DG has been perceived as a front face of OOD generalization. We present empirical evidence to show that the primary reason for generalization in DG is the presence of multiple domains while training. Furthermore, we show that methods for generalization in IID are equally important for generalization in DG. Tailored methods fail to add performance gains in the Traditional DG (TDG) evaluation. Our experiments prompt if TDG has outlived its usefulness in evaluating OOD generalization? To further strengthen our investigation, we propose a novel evaluation strategy, ClassWise DG (CWDG), where for each class, we randomly select one of the domains and keep it aside for testing. We argue that this benchmarking is closer to human learning and relevant in real-world scenarios. Counter-intuitively, despite being exposed to all domains during training, CWDG is more challenging than TDG evaluation. While explaining the observations, our work makes a case for more fundamental analysis around the DG problem before exploring new ideas to tackle it Keywords – generalization of deep networks, constrained formulation, input-output-convex neural network, robust generalization bounds, explainable decision boundaries, mixture of experts

Year of completion:	May 2022
Advisor :	Vineet Gandhi

Related Publications

Downloads

Exploiting Cross-Modal Redundancy for Audio-Visual Generation

Sindhu B Hegde

Abstract

We interact with the world around us through multiple sensory streams of information such as audio, vision, and text (language). Each of these streams complement each other, but also contain redundant information, albeit in different forms. For example, the content of a person speaking can be captured by listening to the sounds in the speech, or partially understood by looking at the speaker’s lip movements, or by reading out the text transcribed from the vocal speech. This redundancy across modalities is utilized in human perceptual understanding that helps us to solve various practical problems. However, in the real-world, more often than not, information in individual streams is corrupted by various types of degradation like electronic transmission, background noise, and blurring which lead to deterioration in the content quality. In this work, we aim to recover the distorted signal in a given stream by exploiting the redundant information in another stream. Specifically, we deal with talking-face videos involving vision and speech signals. We propose two core ideas to explore cross-modal redundancy: (i) denoising speech using visual assistance, and (ii) upsampling very low-resolution talking-face videos using audio assistance. The first part focuses on the task of speech denoising. We show that the visual stream helps in distilling the clean speech from the corrupted signal by suppressing the background noise. We identify the key issues prevailing in the existing state-of-the-art speech enhancement works: (i) most of the current works use only the audio stream and are limited in their performance in a wide range of realworld noises, and (ii) few recent works use the lip-movements as additional cues with an aim to improve the quality of the generated speech over “audio-only” methods. However, they cannot be applied for several applications where the visual stream is unreliable or completely absent. Thus, in this work, we propose a new paradigm for speech enhancement: “pseudo-visual” approach, where the visual stream is synthetically generated from the noisy speech input. We demonstrate that the robustness and the accuracy boost obtained from our model lead to various real-world applications which were previously not possible. In the second part, we explore an interesting question of what can be obtained from an 8 × 8 pixel video sequence by utilizing the corresponding speech of the person talking. Surprisingly, it turns out to be quite a lot. We show that when processed with the right set of audio and image priors, we can obtain a full-length talking video sequence with a 32× scale-factor. When the semantic information about the identity, including basic attributes like age and gender, are almost entirely lost in the input low-resolution video, we show that utilizing the speech that accompanies the low-resolution video aids in recovering the key face attributes. Our proposed audio-visual upsampling network generates realistic, accurate, and high-resolution (256 × 256 pixels) talking-face videos from an 8 × 8 input video. Finally, we demonstrate that our model can be utilized in video conferencing applications where the network bandwidth consumption can be drastically reduced. We hope that our work in cross-modal content recovery enables exciting applications such as smoother video calling, accessibility of video content in low-bandwidth situations, and restoring old historical videos. Our work can also pave the way for future research avenues for cross-modal enhancement of talking-face videos

Year of completion:	June 2022
Advisor :	C V Jawahar,Vinay P Namboodiri

Related Publications

Downloads

Computer-Aided diagnosis of closely related diseases

Abhinav Dhere

Abstract

It is often observed that certain human diseases exhibit similarities in some form while having different prognoses and requiring treatment strategies. These similarities may be in the form of risk factors towards the diseases, symptoms observed, visual similarity in imaging studies, or in some cases, similarity in molecular associations. Computer-Aided Diagnosis (CAD) of closely related diseases is challenging and requires tailored approaches to discriminate between such closely related diseases accurately. This thesis looks at two sets of closely related diseases of two different organs, identified from two different modalities. It develops novel approaches to achieve explainable and accurate CAD for these two close diseases. These two problems are discrimination of healthy, mild cognitive impairment (MCI) and Alzheimer’s Disease (AD) from brain MRI-derived surface mesh and classifying healthy, non-COVID pneumonia and COVID from chest X-ray images. In the first part of this thesis, we present a novel 2D image representation for the brain mesh surface, called a height map. Further, we explore the use of height maps towards the hierarchical classification of healthy, MCI, and AD cases. We also compare different strategies of extracting features and regions of interest from height maps and their performance towards healthy vs. MCI vs. AD classification. We demonstrate that the proposed method achieves fast classification of AD and MCI with minor loss of accuracy compared to state of the art. In the second half of this thesis, we present a novel deep learning architecture called Multi-scale Attention Learning Residual Learning (MARL) and a new conicity loss for training the MARL architecture. We utilize MARL and the conicity loss for achieving hierarchical classification of normal, non-COVID pneumonia and COVID pneumonia from Chest X-ray images. We present classification results on three public datasets and demonstrate that the proposed method achieves comparable or marginally better performance than state-of-the-art in all cases. Further, we demonstrate that the proposed framework achieves clinically consistent explanations with extensive experimentation. Qualitatively, this is shown by comparing GradCAM heatmaps for the proposed method to those for the state-of-the-art method. It is observed that the heatmaps overlap better with the bounding boxes for pneumonia marked by experts compared to the overlap achieved by the state-of-the-art method. Next, we show quantitatively that the GradCAM heatmaps for the proposed method generally lie within inner regions of the lung for nonCOVID pneumonia. However, the same heatmaps lie in outer regions in the case of COVID pneumonia. Thus, we establish the clinical consistency of explanations provided by the proposed framework.

Year of completion:	June 2022
Advisor :	Jayanthi Sivaswamy

Related Publications

Downloads

Deep Learning for Assisting Annotation and Studying Relationship between Cancers using Histopathology Whole Slide Images

Ashish Menon

Abstract

Cancer has been the leading cause of death across the globe and there has been a constant push and research from the scientific community as a whole towards assisting its diagnosis. Particularly the last decade has seen widespread use of computer vision and AI towards tackling cancer diagnosis using both radiological (non-invasive, ex: X-ray, CT ) and pathological modalities (invasive ex: histopathology). A histopathologic Whole Slide Image (WSI) represents digitized image of tissue sample characterized by a large size of up to 109 pixels at maximum resolution. It is considered the gold standard for cancer diagnosis. The routine diagnosis involves experts called pathologists to analyse a slide containing the tissue samples under a microscope. The process is often subjective to cognitive load and the diagnosis is prone to inter and intra pathologist errors. With the digitization of tissue samples as WSI, computerassisted diagnosis would be impactful to address the above issues especially with the advent of deep learning. For this we require models to be trained with large amounts of annotated data as well as understanding the cancer manifestation across organs. In this thesis work, we address two major issues with the help of deep learning techniques. (1) assisting the whole slide image annotation using an expert in the loop and (2) understanding the relationship between cancers and bring to light commonalities of cancer patterns between certain pairs of organs. A typical process of slide diagnosis under a microscope by experts involves exhaustive analysis in scanning across the slides, searching for anomalous/tumorous regions present. Owing to the large dimensions of the histopathology WSI, visually searching for clinically significant regions (patches) is a tedious task for a medical expert. Sequential analysis of several such images further increases the workload resulting in poor diagnosis. A major impediment to automate this task using a deep learning model is the requirement of large amounts of annotated data of WSI patches which is a laborious process and involves exhaustive search for anomalous regions. To tackle this issue, the first part of the thesis work proposes a novel CNN-based, expert feedback-driven interactive learning technique to mitigate this issue. The proposed method seeks to acquire labels of the most informative patches in small increments with multiple feedback rounds to maximize the throughput. It requires the expert to query a patch of interest from a slide and provide feedback to a set of unlabelled patches chosen using the proposed sampling strategy from a ranked list. The proposed technique is applied in a setting that assumes there exists a large cohort of unannotated slides, almost eliminating the need of annotated data upfront; instead learns with an expert involvement. We discuss several strategies for sampling the right set of samples to be labelled by the expert to minimise the expert feedback and maximise the throughput. Theproposed technique can also annotate multiple slides parallelly using a single slide under review (used to query anomalous patches), which in turn reduces the effort of annotation. The Cancer Genome Atlas (TCGA) contains large repositories of histopathology whole slide images spanning across several organs and subtypes. However, not much work has gone into analysis of all the organs and subtypes and their similarities. Our work attempts to bridge this gap by training deep learning models to classify cancer vs normal patches for 11 subtypes spanning 7 organs (9792 tissue slides) to achieve near-perfect classification performance. We used these models to investigate their performances in the test set of other organs (cross organ inference). We found that every model had a good cross organ inference accuracy when tested on breast, colorectal and liver cancers. Further, a high accuracy is observed between models trained on the cancer subtypes originating from same organ (kidney and lung). We also validated these performances by showing the separability of cancer and normal samples in a high dimensional feature space. We further hypothesized that the high cross organ inferences are due to shared tumor morphologies among organs. We validated the hypothesis by showing the overlap in the Gradient-weighted Class Activation Mapping (GradCAM) visualizations and similarities in the distributions of nuclei geometrical features present within the high attention regions.

Year of completion:	June 2022
Advisor :	C V Jawahar,Vinod P K

Fast and Accurate Image Recognition

Sri Aurobindo Munagala

Abstract

Related Publications

Downloads

Towards Understanding and Improving the Generalization Performance of Neural Networks

sarath sivaprasad

Abstract

Related Publications

Downloads

Exploiting Cross-Modal Redundancy for Audio-Visual Generation

Sindhu B Hegde

Abstract

Related Publications

Downloads

Computer-Aided diagnosis of closely related diseases

Abhinav Dhere

Abstract

Related Publications

Downloads

Deep Learning for Assisting Annotation and Studying Relationship between Cancers using Histopathology Whole Slide Images

Ashish Menon

Abstract

Related Publications

Downloads

More Articles …