Thesis Students

Adversarial Training for Unsupervised Monocular Depth Estimation

Ishit Mehta

Abstract

The problem of estimating scene-depth from a single image has seen great progress recently. It is one of the foundational problems in computer vision and hence been studied from various angles. Since the advent of deep learning, most of the approaches are data driven. These methods train high-capacity models with large amount of data in an end-to-end fashion. They rely on ground-truth depth, which is hard to capture and process. Recently self-supervised methods have been proposed which rely on view supervision as an alternative. These methods minimize photometric reconstruction error in order to learn depth. In this work,we propose a geometry-aware generative adversarial network to generate multiple novel views from a single image. Novel views are generated by learning depth as an intermediate step. The synthesized views are discerned from real images using discriminative learning. We show the gains of using the ad- versarial framework over previous methods. Furthermore, we present a structured adversarial training routine to train the network, going from easy examples to difficult ones. The combination of adversarial framework, multi-view learning, and structured training produces state-of-the-art performance on unsupervised depth estimation for monocular images. We also compare our method with human depth perception by conducting a series of experiments. We investigate the existence of monocular depth cues like relative size, occlusion and height in the visual field in artificial vision systems. With quantitative and qualitative experiments, we highlight the shortcomings of artificial depth perception and propose future avenues for research.

Year of completion:	July 2019
Advisor :	P J Narayanan

Related Publications

Downloads

Exploring Binarization and Pruning of Convolutional Neural Networks

Ameya Prabhu

Abstract

Deep learning models have evolved remarkably, and are pushing the state-of-the-art in various problems across domains. At the same time, the complexity and the amount of resources these DNNs consume has greatly increased. Today’s DNNs are computationally intensive to train and run, especially Convolutional Neural Networks (CNNs) used for vision applications. They also occupy a large amount of memory and consume a large amount of power during training. This poses a major roadblock to the deployment of such networks, especially in real-time applications or on resource-limited devices. Two methods have shown promise in compressing CNNs: (i) Binarization and (ii) Pruning. We explore these two methods in this thesis. The first method to achieve improvements in computational/spatial efficiency is to binarize (1-bit quantize) the weights and activations in a network. However, naive binarization results in accuracy drops for most tasks. In this work, we present a Distribution-Aware approach to Binarizing Networks (DABN) that allows us to retain the advantages of a binarized network, while improving accuracy over binary networks. We also develop efficient implementations of DABN across different architectures. We present a theoretical analysis of DABN to show the effective representational power of the resulting layers, and explore the forms of data they model best. Experiments on popular sketch datasets show that DABN offers better accuracies than naive binarization. We further investigate the question of where to binarize inputs at layer-level granularity and show that selectively binarizing the inputs to specific layers in the network could lead to significant improvements in accuracy while preserving most of the advantages of binarization. We analyze the binarization tradeoff using a metric that jointly models the input binarization error and computational cost. We introduce an efficient algorithm to select layers whose inputs are to be binarized. We discuss practical guidelines based on insights obtained from applying the algorithm to a variety of models. Experiments on the Imagenet dataset using AlexNet and ResNet-18 models show 3-4% improvement in accuracy over fully binarized networks with minimal impact on compression. The improvements are even more substantial on sketch datasets like TU-Berlin, where we match state-of-the-art accuracy, getting more than 8% increase in accuracies over binary networks. We further show that our approach can be applied in tandem with other forms of compression that deal with individual layers or overall model compression (e.g., SqueezeNets). In contrast to previous binarization approaches, we are able to binarize the weights in the last layers of a network, which enables us to compress a large fraction of additional parameters. The second method explored is pruning. We investigate pruning neural networks from a graph-theoretic perspective. Efficient CNN designs like ResNets and DenseNet were proposed to improve accuracy vs efficiency trade-offs. They essentially increased the connectivity, allowing efficient information flow across layers. Inspired by these techniques, we propose to model connections between filters of a CNN using graphs which are simultaneously sparse and well-connected. Sparsity results in efficiency while well-connectedness can preserve the expressive power of the CNNs. We use a well-studied class of graphs from theoretical computer science that satisfies these properties known as Expander graphs. Expander graphs are used to model connections between filters in CNNs to design networks called XNets. We present two guarantees on the connectivity of X-Nets: (i) Each node of a layer influences every node in a layer O(logn) steps away, where n is the number of layers between the two layers (ii) The number of paths between two sets of nodes is proportional to the product of their sizes. We also propose efficient training and inference algorithms, making it possible to train deeper and wider X-Nets effectively. Expander based models give a 4% improvement in accuracy on MobileNet over grouped convolutions, a popular technique which has the same sparsity but worse connectivity. X-Nets give better performance trade-offs than the original ResNet and DenseNet-BC architectures. We achieve model sizes comparable to state-of-the-art pruning techniques using our simple architecture design, without any pruning.

Year of completion:	July 2019
Advisor :	Anoop M Namboodiri

Related Publications

Downloads

Retinal Image Quality Improvement via Learning

Sukesh Adiga V

Abstract

Retinal images are widely used to detect and diagnose many diseases such as Diabetic Retinopathy (DR), glaucoma, Age-related Macular Degeneration, Cystoid Macular Edema, coronary heart disease,and so on. These diseases affect vision and lead to irreversible blindness. Early image-based screening and monitoring of the patient is a solution. Imaging of retina is commonly done either through Optical Coherence Tomography (OCT) or Fundus photography. OCT captures cross-sectional information about the retinal layers in a 3D volume, whereas fundus imaging projects retinal tissues onto the 2D imaging plane. Recently smartphone camera-based fundus imaging is being explored with a relatively low-cost. Imaging retina with these technologies pose challenges due to physical properties of the light source, or quality of optics and sensors used or low and uneven light condition. In this thesis, we look at learning based approaches, namely neural network techniques to improve the quality of retinal images to aid diagnosis. The first part of this thesis aims at denoising OCT images, which are corrupted by speckle noise due to underlying coherence-based imaging technique. We propose a new method for denoising OCT images based on Convolutional Neural Network by learning common features from unpaired noisy and clean OCT images in an unsupervised, end-to-end manner. The proposed method consists of a combination of two autoencoders with shared encoder layers, which we call as Shared Encoder (SE) architecture. The SE is trained to reconstruct noisy and clean OCT images with respective autoencoders, and denoised OCT image is obtained using a cross-model prediction. The proposed method can be used for denoising OCT images with or without pathology from any scanner. The SE architecture was assessed using public datasets and found to perform better than baseline methods exhibiting a good balance of retaining anatomical integrity and speckle reduction. The second problem we focus on is the enhancement of fundus images acquired with a Smartphone camera (SC). SC image is a cost-effective solution for the assessment of retina, especially in screening. However, imaging at high magnification and low light levels results in loss of details, uneven illumination, noise particularly in the peripheral region and flash-induced artefacts. We address these problems by matching the characteristics of images from SC to those from a regular fundus camera (FC) using either unpaired or paired data. Two mapping solutions are designed using deep learning technique in an unsupervised and supervised manner. The unsupervised architecture called ResCycleGAN is based on the CycleGAN with two significant changes: A residual connection is introduced to aid learning only the correction required; A structure similarity based loss function is used to improve the clarity of anatomical structures and pathologies. This method can handle variations seen in normal and pathological images, acquired even without mydriasis, which is attractive in screening. The method produces consistently balanced results, outperforms CycleGAN both qualitatively and quantitatively, and has more pleasing results. Next, a new architecture is proposed called SupEnh, which handles noise removal using paired data. The proposed method enhances the quality of SC images along with denoising in an end-to-end, supervised manner. Obtaining paired data is challenging; however, it is feasible in fixed clinical settings or commercial product as it is required once for learning. The proposed SupEnh method based on U-net consists of an encoder and two decoders. The network simplifies the task by learning denoising and mapping to FC separately with two decoders. The method handles images with/without pathologies as well as images acquired even without mydriasis. The SupEnh was assessed using private datasets and found to performs better than U-net. The cross-validation results show method is robust to change in image quality. The enhancement using SupEnh method achieves 5% higher AUC for early stage DR detection when compared with original images.

Year of completion:	August 2019
Advisor :	Jayanthi Sivaswamy

Related Publications

Downloads

Extending Visual Object Tracking for Long Time Horizons

Abhinav Moudgil

Abstract

Visual object tracking is a fundamental task in computer vision and is a key component in wide range of applications like surveillance, autonomous navigation, video analysis and editing, augmented reality etc. Given a target object with bounding box in the first frame, the goal in visual object tracking is to track the given target in the subsequent frames. Although significant progress has been made in this domain to address various challenges like occlusion, scale change etc., we observe that tracking on a large number of short sequences as done in previous benchmarks does not clearly bring out the competence or potential of a tracking algorithm. Moreover, even if a tracking algorithm works well on challenging small sequences and fails on moderately difficult long sequences, it will be of limited practical importance since many tracking applications rely on precise long-term tracking. Thus, we extend the problem of visual object tracking for long time horizons systematically in this thesis. First, we first introduce a long-term visual object tracking benchmark. We propose a novel largescale dataset, specifically tailored for long-term tracking. Our dataset consists of high resolution, densely annotated sequences, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and train better deep learning architectures (avoiding/reducing augmentation, which may not reflect real world behaviour). We also propose a novel metric for long-term tracking which captures the ability of a tracker to track consistently for long duration. We benchmark 17 state of the art trackers on our dataset and rank them according to several evaluation metrics and run time speeds. Next, we analyze the long-term tracking performance of state of the art trackers in depth. We focus on the three key aspects of long-term tracking: Re-detection, Recovery and Reliability. Specifically, we (a) test re-detection capability of the trackers in the wild by simulating virtual cuts, (b) investigate the role of chance in recovery of tracker post failure and (c) propose a novel metric allowing visual inference on the contiguous and consistent aspect of tracking. We present several insights derived from an extensive set of quantitative and qualitative experiments. Lastly, we present a novel fully convolutional anchor free siamese framework for visual object tracking. Previous works utilized anchor based region proposal networks to improve the performance of siamese correlation based trackers while maintaining real-time speed. However, we show that enumerating multiple boxes at each keypoint location in the search region is inefficient and unsuitable for the task of single object tracking, where we just need to locate one target object. Thus, we take an alternate approach by directly regressing box offsets and sizes for keypoint locations in the search region. This proposed approach, dubbed SiamReg, is fully convolutional, anchor free, lighter in weight and improves target localization. We train our framework end-to-end with Generalized IoU loss for bounding box regression and cross entropy loss for target classification. We perform several experiments on standard tracking benchmarks to demonstrate the effectiveness of our approach.

Year of completion:	September 2019
Advisor :	Vineet Gandhi

Related Publications

Downloads

Blending the Past and Present of Automatic Image Annotation

Ayushi Dutta

Abstract

Real world images depict varying scenes, actions and multiple objects interacting with each other. We consider the fundamental Computer Vision problem of image annotation, where an image needs to be automatically tagged with a set of discrete labels that best describe its semantics. As more and more digital images become available, image annotation can help in the automatic archival and retrieval of large image collections. Being at the heart of image understanding, image annotation can also assist in other visual learning tasks, such as image captioning, scene recognition, multi-object recognition, etc.. With the advent of deep neural networks, recent research has achieved ground-breaking results in single-label image classification. However, for images representing the real world, containing different objects in varying scales and viewpoints, modelling the semantic relationship between images and all of their associated labels continues to remain a challenging problem. Additional challenges are posed from class-imbalance, incomplete labelling, label-ambiguity and several other issues that are commonly observed in the image annotation datasets. In this thesis, we study the image annotation task from two aspects. First, we bring to attention some of the core issues in the image annotation domain related to dataset properties and evaluation metrics that inherently affect the annotation performance of existing approaches to a significant extent. To examine these key aspects, we evaluate ten benchmark image annotation techniques on five popular datasets using the same baseline features, and perform thorough empirical analyses. With novel experiments, we explore possible reasons behind variations in per-label versus per-image evaluation criteria and discuss when each one of these should be used. We investigate dataset specific biases and propose new quantitative measures to quantify the degree of image and label diversity in a dataset, that can also be useful in developing new image annotation datasets. We believe the conclusions derived in this analysis would be helpful in making systematic advancements in this domain. Second, we attempt to address the annotation task by a CNN-RNN framework that jointly models label dependencies in an image while annotating it. We base our approach on the premise that labels corresponding to different visual concepts in an image share rich semantic relationships among them (e.g., “sky” is related to “cloud”). We follow recent works that have explored the CNN-RNN style models due to RNN’s capacity to model higher-order dependencies, but are limited in their approach to train the RNN in a pre-defined label order sequence. We overcome this limitation and propose a new method to learn multiple label prediction paths. We evaluate our proposed method on a number of popular and relevant datasets and achieve superior results compared to existing CNN-RNN based approaches. We also discuss the scope of the CNN-RNN framework in the context of image annotation.

Year of completion:	November 2019
Advisor :	Prof. C.V. Jawahar and Yashaswi Verma

Adversarial Training for Unsupervised Monocular Depth Estimation

Ishit Mehta

Abstract

Related Publications

Downloads

Exploring Binarization and Pruning of Convolutional Neural Networks

Ameya Prabhu

Abstract

Related Publications

Downloads

Retinal Image Quality Improvement via Learning

Sukesh Adiga V

Abstract

Related Publications

Downloads

Extending Visual Object Tracking for Long Time Horizons

Abhinav Moudgil

Abstract

Related Publications

Downloads

Blending the Past and Present of Automatic Image Annotation

Ayushi Dutta

Abstract

Related Publications

Downloads

More Articles …