CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Development and Tracking of Consensus Mesh for Monocular Depth Sequences


Gaurav Mishra

Abstract

Human body tracking typically requires specialized capture set-ups. Although pose tracking is available in consumer devices like Microsoft Kinect, it is restricted to stick figures visualizing body part detection. In this thesis, we propose a method for full 3D human body shape and motion capture of arbitrary movements from the depth channel of a single Kinect, when the subject wears casual clothes. We do not use the RGB channel or an initialization procedure that requires the subject to move around in front of the camera. This makes our method applicable for arbitrary clothing textures and lighting environments, with minimal subject intervention. Our method consists of 3D surface feature detection and articulated motion tracking, which is regularized by a statistical human body model [40]. We also propose the idea of a Consensus Mesh (CMesh) which is the 3D template of a person created from a single view point. We demonstrate tracking results on challenging poses and argue that using CMesh along with statistical body models can improve tracking accuracies. Quantitative evaluation of our dense body tracking shows that our method has very little drift which is improved by the usage of CMesh We explore the possibility of improving the quality of CMesh using RGB images in a post processing step. For this we propose a pipeline involving Generative Adversarial Networks. We show that CMesh can be improved from RGB images of the original person by learning corresponding relative normal maps ( N R map ). These N R map have the potential to encode the nuances in the CMesh with respect to ground truth object. We explore such method in a synthetic setting for static human like objects. We demonstrate quantitatively that details which are learned from such a pipeline are invariant to lighting and texture changes. In future the generated N R map can be used to improve the quality of CMesh

 

Year of completion:  June 2019
 Advisor : P J Narayanan and Kiran Varanasi

Related Publications


    Downloads

    thesis

    Adversarial Training for Unsupervised Monocular Depth Estimation


    Ishit Mehta

    Abstract

    The problem of estimating scene-depth from a single image has seen great progress recently. It is one of the foundational problems in computer vision and hence been studied from various angles. Since the advent of deep learning, most of the approaches are data driven. These methods train high-capacity models with large amount of data in an end-to-end fashion. They rely on ground-truth depth, which is hard to capture and process. Recently self-supervised methods have been proposed which rely on view supervision as an alternative. These methods minimize photometric reconstruction error in order to learn depth. In this work,we propose a geometry-aware generative adversarial network to generate multiple novel views from a single image. Novel views are generated by learning depth as an intermediate step. The synthesized views are discerned from real images using discriminative learning. We show the gains of using the ad- versarial framework over previous methods. Furthermore, we present a structured adversarial training routine to train the network, going from easy examples to difficult ones. The combination of adversarial framework, multi-view learning, and structured training produces state-of-the-art performance on unsupervised depth estimation for monocular images. We also compare our method with human depth perception by conducting a series of experiments. We investigate the existence of monocular depth cues like relative size, occlusion and height in the visual field in artificial vision systems. With quantitative and qualitative experiments, we highlight the shortcomings of artificial depth perception and propose future avenues for research.

     

    Year of completion:  July 2019
     Advisor : P J Narayanan

    Related Publications


      Downloads

      thesis

      Exploring Binarization and Pruning of Convolutional Neural Networks


      Ameya Prabhu

      Abstract

      Deep learning models have evolved remarkably, and are pushing the state-of-the-art in various problems across domains. At the same time, the complexity and the amount of resources these DNNs consume has greatly increased. Today’s DNNs are computationally intensive to train and run, especially Convolutional Neural Networks (CNNs) used for vision applications. They also occupy a large amount of memory and consume a large amount of power during training. This poses a major roadblock to the deployment of such networks, especially in real-time applications or on resource-limited devices. Two methods have shown promise in compressing CNNs: (i) Binarization and (ii) Pruning. We explore these two methods in this thesis. The first method to achieve improvements in computational/spatial efficiency is to binarize (1-bit quantize) the weights and activations in a network. However, naive binarization results in accuracy drops for most tasks. In this work, we present a Distribution-Aware approach to Binarizing Networks (DABN) that allows us to retain the advantages of a binarized network, while improving accuracy over binary networks. We also develop efficient implementations of DABN across different architectures. We present a theoretical analysis of DABN to show the effective representational power of the resulting layers, and explore the forms of data they model best. Experiments on popular sketch datasets show that DABN offers better accuracies than naive binarization. We further investigate the question of where to binarize inputs at layer-level granularity and show that selectively binarizing the inputs to specific layers in the network could lead to significant improvements in accuracy while preserving most of the advantages of binarization. We analyze the binarization tradeoff using a metric that jointly models the input binarization error and computational cost. We introduce an efficient algorithm to select layers whose inputs are to be binarized. We discuss practical guidelines based on insights obtained from applying the algorithm to a variety of models. Experiments on the Imagenet dataset using AlexNet and ResNet-18 models show 3-4% improvement in accuracy over fully binarized networks with minimal impact on compression. The improvements are even more substantial on sketch datasets like TU-Berlin, where we match state-of-the-art accuracy, getting more than 8% increase in accuracies over binary networks. We further show that our approach can be applied in tandem with other forms of compression that deal with individual layers or overall model compression (e.g., SqueezeNets). In contrast to previous binarization approaches, we are able to binarize the weights in the last layers of a network, which enables us to compress a large fraction of additional parameters. The second method explored is pruning. We investigate pruning neural networks from a graph-theoretic perspective. Efficient CNN designs like ResNets and DenseNet were proposed to improve accuracy vs efficiency trade-offs. They essentially increased the connectivity, allowing efficient information flow across layers. Inspired by these techniques, we propose to model connections between filters of a CNN using graphs which are simultaneously sparse and well-connected. Sparsity results in efficiency while well-connectedness can preserve the expressive power of the CNNs. We use a well-studied class of graphs from theoretical computer science that satisfies these properties known as Expander graphs. Expander graphs are used to model connections between filters in CNNs to design networks called XNets. We present two guarantees on the connectivity of X-Nets: (i) Each node of a layer influences every node in a layer O(logn) steps away, where n is the number of layers between the two layers (ii) The number of paths between two sets of nodes is proportional to the product of their sizes. We also propose efficient training and inference algorithms, making it possible to train deeper and wider X-Nets effectively. Expander based models give a 4% improvement in accuracy on MobileNet over grouped convolutions, a popular technique which has the same sparsity but worse connectivity. X-Nets give better performance trade-offs than the original ResNet and DenseNet-BC architectures. We achieve model sizes comparable to state-of-the-art pruning techniques using our simple architecture design, without any pruning.

       

      Year of completion:  July 2019
       Advisor : Anoop M Namboodiri

      Related Publications


        Downloads

        thesis

        Retinal Image Quality Improvement via Learning


        Sukesh Adiga V

        Abstract

        Retinal images are widely used to detect and diagnose many diseases such as Diabetic Retinopathy (DR), glaucoma, Age-related Macular Degeneration, Cystoid Macular Edema, coronary heart disease,and so on. These diseases affect vision and lead to irreversible blindness. Early image-based screening and monitoring of the patient is a solution. Imaging of retina is commonly done either through Optical Coherence Tomography (OCT) or Fundus photography. OCT captures cross-sectional information about the retinal layers in a 3D volume, whereas fundus imaging projects retinal tissues onto the 2D imaging plane. Recently smartphone camera-based fundus imaging is being explored with a relatively low-cost. Imaging retina with these technologies pose challenges due to physical properties of the light source, or quality of optics and sensors used or low and uneven light condition. In this thesis, we look at learning based approaches, namely neural network techniques to improve the quality of retinal images to aid diagnosis. The first part of this thesis aims at denoising OCT images, which are corrupted by speckle noise due to underlying coherence-based imaging technique. We propose a new method for denoising OCT images based on Convolutional Neural Network by learning common features from unpaired noisy and clean OCT images in an unsupervised, end-to-end manner. The proposed method consists of a combination of two autoencoders with shared encoder layers, which we call as Shared Encoder (SE) architecture. The SE is trained to reconstruct noisy and clean OCT images with respective autoencoders, and denoised OCT image is obtained using a cross-model prediction. The proposed method can be used for denoising OCT images with or without pathology from any scanner. The SE architecture was assessed using public datasets and found to perform better than baseline methods exhibiting a good balance of retaining anatomical integrity and speckle reduction. The second problem we focus on is the enhancement of fundus images acquired with a Smartphone camera (SC). SC image is a cost-effective solution for the assessment of retina, especially in screening. However, imaging at high magnification and low light levels results in loss of details, uneven illumination, noise particularly in the peripheral region and flash-induced artefacts. We address these problems by matching the characteristics of images from SC to those from a regular fundus camera (FC) using either unpaired or paired data. Two mapping solutions are designed using deep learning technique in an unsupervised and supervised manner. The unsupervised architecture called ResCycleGAN is based on the CycleGAN with two significant changes: A residual connection is introduced to aid learning only the correction required; A structure similarity based loss function is used to improve the clarity of anatomical structures and pathologies. This method can handle variations seen in normal and pathological images, acquired even without mydriasis, which is attractive in screening. The method produces consistently balanced results, outperforms CycleGAN both qualitatively and quantitatively, and has more pleasing results. Next, a new architecture is proposed called SupEnh, which handles noise removal using paired data. The proposed method enhances the quality of SC images along with denoising in an end-to-end, supervised manner. Obtaining paired data is challenging; however, it is feasible in fixed clinical settings or commercial product as it is required once for learning. The proposed SupEnh method based on U-net consists of an encoder and two decoders. The network simplifies the task by learning denoising and mapping to FC separately with two decoders. The method handles images with/without pathologies as well as images acquired even without mydriasis. The SupEnh was assessed using private datasets and found to performs better than U-net. The cross-validation results show method is robust to change in image quality. The enhancement using SupEnh method achieves 5% higher AUC for early stage DR detection when compared with original images.

         

        Year of completion:  August 2019
         Advisor : Jayanthi Sivaswamy

        Related Publications


          Downloads

          thesis

          Extending Visual Object Tracking for Long Time Horizons


          Abhinav Moudgil

          Abstract

          Visual object tracking is a fundamental task in computer vision and is a key component in wide range of applications like surveillance, autonomous navigation, video analysis and editing, augmented reality etc. Given a target object with bounding box in the first frame, the goal in visual object tracking is to track the given target in the subsequent frames. Although significant progress has been made in this domain to address various challenges like occlusion, scale change etc., we observe that tracking on a large number of short sequences as done in previous benchmarks does not clearly bring out the competence or potential of a tracking algorithm. Moreover, even if a tracking algorithm works well on challenging small sequences and fails on moderately difficult long sequences, it will be of limited practical importance since many tracking applications rely on precise long-term tracking. Thus, we extend the problem of visual object tracking for long time horizons systematically in this thesis. First, we first introduce a long-term visual object tracking benchmark. We propose a novel largescale dataset, specifically tailored for long-term tracking. Our dataset consists of high resolution, densely annotated sequences, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and train better deep learning architectures (avoiding/reducing augmentation, which may not reflect real world behaviour). We also propose a novel metric for long-term tracking which captures the ability of a tracker to track consistently for long duration. We benchmark 17 state of the art trackers on our dataset and rank them according to several evaluation metrics and run time speeds. Next, we analyze the long-term tracking performance of state of the art trackers in depth. We focus on the three key aspects of long-term tracking: Re-detection, Recovery and Reliability. Specifically, we (a) test re-detection capability of the trackers in the wild by simulating virtual cuts, (b) investigate the role of chance in recovery of tracker post failure and (c) propose a novel metric allowing visual inference on the contiguous and consistent aspect of tracking. We present several insights derived from an extensive set of quantitative and qualitative experiments. Lastly, we present a novel fully convolutional anchor free siamese framework for visual object tracking. Previous works utilized anchor based region proposal networks to improve the performance of siamese correlation based trackers while maintaining real-time speed. However, we show that enumerating multiple boxes at each keypoint location in the search region is inefficient and unsuitable for the task of single object tracking, where we just need to locate one target object. Thus, we take an alternate approach by directly regressing box offsets and sizes for keypoint locations in the search region. This proposed approach, dubbed SiamReg, is fully convolutional, anchor free, lighter in weight and improves target localization. We train our framework end-to-end with Generalized IoU loss for bounding box regression and cross entropy loss for target classification. We perform several experiments on standard tracking benchmarks to demonstrate the effectiveness of our approach.

           

          Year of completion:  September 2019
           Advisor : Vineet Gandhi

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Blending the Past and Present of Automatic Image Annotation
            2. Audio-Visual Speech Recognition and Synthesis
            3. On Compact Deep Neural Networks for Visual Place Recognition, Object Recognition and Visual Localization
            4. Road Topology Extraction from Satellite images by Knowledge Sharing
            • Start
            • Prev
            • 17
            • 18
            • 19
            • 20
            • 21
            • 22
            • 23
            • 24
            • 25
            • 26
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.