CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us
  • Login

Retinal Image Quality Improvement via Learning


Sukesh Adiga V

Abstract

Retinal images are widely used to detect and diagnose many diseases such as Diabetic Retinopathy (DR), glaucoma, Age-related Macular Degeneration, Cystoid Macular Edema, coronary heart disease,and so on. These diseases affect vision and lead to irreversible blindness. Early image-based screening and monitoring of the patient is a solution. Imaging of retina is commonly done either through Optical Coherence Tomography (OCT) or Fundus photography. OCT captures cross-sectional information about the retinal layers in a 3D volume, whereas fundus imaging projects retinal tissues onto the 2D imaging plane. Recently smartphone camera-based fundus imaging is being explored with a relatively low-cost. Imaging retina with these technologies pose challenges due to physical properties of the light source, or quality of optics and sensors used or low and uneven light condition. In this thesis, we look at learning based approaches, namely neural network techniques to improve the quality of retinal images to aid diagnosis. The first part of this thesis aims at denoising OCT images, which are corrupted by speckle noise due to underlying coherence-based imaging technique. We propose a new method for denoising OCT images based on Convolutional Neural Network by learning common features from unpaired noisy and clean OCT images in an unsupervised, end-to-end manner. The proposed method consists of a combination of two autoencoders with shared encoder layers, which we call as Shared Encoder (SE) architecture. The SE is trained to reconstruct noisy and clean OCT images with respective autoencoders, and denoised OCT image is obtained using a cross-model prediction. The proposed method can be used for denoising OCT images with or without pathology from any scanner. The SE architecture was assessed using public datasets and found to perform better than baseline methods exhibiting a good balance of retaining anatomical integrity and speckle reduction. The second problem we focus on is the enhancement of fundus images acquired with a Smartphone camera (SC). SC image is a cost-effective solution for the assessment of retina, especially in screening. However, imaging at high magnification and low light levels results in loss of details, uneven illumination, noise particularly in the peripheral region and flash-induced artefacts. We address these problems by matching the characteristics of images from SC to those from a regular fundus camera (FC) using either unpaired or paired data. Two mapping solutions are designed using deep learning technique in an unsupervised and supervised manner. The unsupervised architecture called ResCycleGAN is based on the CycleGAN with two significant changes: A residual connection is introduced to aid learning only the correction required; A structure similarity based loss function is used to improve the clarity of anatomical structures and pathologies. This method can handle variations seen in normal and pathological images, acquired even without mydriasis, which is attractive in screening. The method produces consistently balanced results, outperforms CycleGAN both qualitatively and quantitatively, and has more pleasing results. Next, a new architecture is proposed called SupEnh, which handles noise removal using paired data. The proposed method enhances the quality of SC images along with denoising in an end-to-end, supervised manner. Obtaining paired data is challenging; however, it is feasible in fixed clinical settings or commercial product as it is required once for learning. The proposed SupEnh method based on U-net consists of an encoder and two decoders. The network simplifies the task by learning denoising and mapping to FC separately with two decoders. The method handles images with/without pathologies as well as images acquired even without mydriasis. The SupEnh was assessed using private datasets and found to performs better than U-net. The cross-validation results show method is robust to change in image quality. The enhancement using SupEnh method achieves 5% higher AUC for early stage DR detection when compared with original images.

 

Year of completion:  August 2019
 Advisor : Jayanthi Sivaswamy

Related Publications


    Downloads

    thesis

    Extending Visual Object Tracking for Long Time Horizons


    Abhinav Moudgil

    Abstract

    Visual object tracking is a fundamental task in computer vision and is a key component in wide range of applications like surveillance, autonomous navigation, video analysis and editing, augmented reality etc. Given a target object with bounding box in the first frame, the goal in visual object tracking is to track the given target in the subsequent frames. Although significant progress has been made in this domain to address various challenges like occlusion, scale change etc., we observe that tracking on a large number of short sequences as done in previous benchmarks does not clearly bring out the competence or potential of a tracking algorithm. Moreover, even if a tracking algorithm works well on challenging small sequences and fails on moderately difficult long sequences, it will be of limited practical importance since many tracking applications rely on precise long-term tracking. Thus, we extend the problem of visual object tracking for long time horizons systematically in this thesis. First, we first introduce a long-term visual object tracking benchmark. We propose a novel largescale dataset, specifically tailored for long-term tracking. Our dataset consists of high resolution, densely annotated sequences, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and train better deep learning architectures (avoiding/reducing augmentation, which may not reflect real world behaviour). We also propose a novel metric for long-term tracking which captures the ability of a tracker to track consistently for long duration. We benchmark 17 state of the art trackers on our dataset and rank them according to several evaluation metrics and run time speeds. Next, we analyze the long-term tracking performance of state of the art trackers in depth. We focus on the three key aspects of long-term tracking: Re-detection, Recovery and Reliability. Specifically, we (a) test re-detection capability of the trackers in the wild by simulating virtual cuts, (b) investigate the role of chance in recovery of tracker post failure and (c) propose a novel metric allowing visual inference on the contiguous and consistent aspect of tracking. We present several insights derived from an extensive set of quantitative and qualitative experiments. Lastly, we present a novel fully convolutional anchor free siamese framework for visual object tracking. Previous works utilized anchor based region proposal networks to improve the performance of siamese correlation based trackers while maintaining real-time speed. However, we show that enumerating multiple boxes at each keypoint location in the search region is inefficient and unsuitable for the task of single object tracking, where we just need to locate one target object. Thus, we take an alternate approach by directly regressing box offsets and sizes for keypoint locations in the search region. This proposed approach, dubbed SiamReg, is fully convolutional, anchor free, lighter in weight and improves target localization. We train our framework end-to-end with Generalized IoU loss for bounding box regression and cross entropy loss for target classification. We perform several experiments on standard tracking benchmarks to demonstrate the effectiveness of our approach.

     

    Year of completion:  September 2019
     Advisor : Vineet Gandhi

    Related Publications


      Downloads

      thesis

      Blending the Past and Present of Automatic Image Annotation


      Ayushi Dutta

      Abstract

      Real world images depict varying scenes, actions and multiple objects interacting with each other. We consider the fundamental Computer Vision problem of image annotation, where an image needs to be automatically tagged with a set of discrete labels that best describe its semantics. As more and more digital images become available, image annotation can help in the automatic archival and retrieval of large image collections. Being at the heart of image understanding, image annotation can also assist in other visual learning tasks, such as image captioning, scene recognition, multi-object recognition, etc.. With the advent of deep neural networks, recent research has achieved ground-breaking results in single-label image classification. However, for images representing the real world, containing different objects in varying scales and viewpoints, modelling the semantic relationship between images and all of their associated labels continues to remain a challenging problem. Additional challenges are posed from class-imbalance, incomplete labelling, label-ambiguity and several other issues that are commonly observed in the image annotation datasets. In this thesis, we study the image annotation task from two aspects. First, we bring to attention some of the core issues in the image annotation domain related to dataset properties and evaluation metrics that inherently affect the annotation performance of existing approaches to a significant extent. To examine these key aspects, we evaluate ten benchmark image annotation techniques on five popular datasets using the same baseline features, and perform thorough empirical analyses. With novel experiments, we explore possible reasons behind variations in per-label versus per-image evaluation criteria and discuss when each one of these should be used. We investigate dataset specific biases and propose new quantitative measures to quantify the degree of image and label diversity in a dataset, that can also be useful in developing new image annotation datasets. We believe the conclusions derived in this analysis would be helpful in making systematic advancements in this domain. Second, we attempt to address the annotation task by a CNN-RNN framework that jointly models label dependencies in an image while annotating it. We base our approach on the premise that labels corresponding to different visual concepts in an image share rich semantic relationships among them (e.g., “sky” is related to “cloud”). We follow recent works that have explored the CNN-RNN style models due to RNN’s capacity to model higher-order dependencies, but are limited in their approach to train the RNN in a pre-defined label order sequence. We overcome this limitation and propose a new method to learn multiple label prediction paths. We evaluate our proposed method on a number of popular and relevant datasets and achieve superior results compared to existing CNN-RNN based approaches. We also discuss the scope of the CNN-RNN framework in the context of image annotation.

      Year of completion:  November 2019
       Advisor : Prof. C.V. Jawahar and Yashaswi Verma

      Related Publications


        Downloads

        thesis

         ppt

        Audio-Visual Speech Recognition and Synthesis

         


        Abhishek Jha

        Abstract

        Understanding speech in the absence of audio, from the visual perception of lip-motion can aid a variety of computer vision applications. System comprehending ‘silent speech presents a promising potential for low bandwidth video-calling, speech transmission in auditory noisy environment to aid for hearing impaired. While presenting numerous opportunities, it is highly difficult to model lips in silent speech video by observing lip-motion of speaker. Albeit developments in automatic-speech recognition (ASR) has yielded better audio-speech recognition systems in last two decades, in the presence of noise their performance drastically deteriorates. This calls for a computer vision solution to the speech under- standing problem. In this thesis, we present two solutions for modelling lips in silent speech videos. In the first part of the thesis, we propose a word-spotting solution for searching spoken keywords in silent lip-videos. In this work on visual speech recognition our contributions are twofold: we develop a pipeline for recognition-free retrieval, and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. 2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatio-temporal landmarks of the query and the top retrieval candidates. The proposed pipeline improves baseline performance by over 35% for word-spotting task on one of the largest lipreading corpus. We demonstrate the robustness of our method through a series of experiments, by investigating domain invariance, out-of-vocabulary prediction and careful analysis of results on dataset. We also present the qualitative results showing success and failure cases. We finally show the application of our method by spotting words in an archaic speech video. In the second part of our work, we propose a lip-synchronization solution for ‘visually redubbing speech videos in a target language. Current methods of adapting a native speech video in foreign lan- guage either through placement of subtitle in the video, which distracts the viewer or through audio redubbing the video in the target language. This causes unsynchronized lip-motion of the speaker with respect to the redubbed audio, resulting in video appearing unnatural. In this work, we propose two lip synchronization methods: 1) cross-accent lip-synchronization for change in accent of the same language audio dubbing, and 2) cross-language lip-synchronization for speech videos dubbed in a differentlanguage. Since viseme remains the same in cross-accent dubbing, we propose a dynamic programing algorithm to align the visual speech from the original video with the accented speech in the target audio. In cross-language dubbing overall linguistics changes, hence we propose a lip-synthesis model conditioned upon on the redubbed audio. Finally, a user-based study is conducted, which validates our claim of better viewing experience in comparison to baseline methods. We present the application of both these methods by ‘visually redubbing Andrew Ngs machine learning tutorial video clips in Indian accented English and Hindi language respectively. In the final part of this thesis, we propose an improved method of 2D lip-landmark localization method. We investigated the current landmark localization techniques in facial domain and human- pose estimation to discover the shortcoming in adapting these methods for the task of lip-landmark localization. Present state-of-the-art methods in the domain considers lip-landmarks as a subset of facial landmarks and hence doesn’t explicitly optimizes for it. In this work we propose a new lip-centric loss formulation on the existing stacked-hourglass architecture which improves the baseline performance. Finally we use 300W and 300VW faces dataset to show the performance of our methods and compare them with the baselines. Overall, in this thesis we examined the current methods of lip modelling, investigated them for their shortcomings and proposed solutions to overcome those challenges. We perform detailed studies, and ablation studies to study our proposed methods and reported both success and failure cases for the same. We compare our solutions with the current baseline on challenging datasets, reporting quantitative results and demonstrating qualitative performances. Our proposed solutions improves the baseline performances in their individual domains.

         

        Year of completion:  April 2019
         Advisor : C V Jawahar, Vinay P. Namboodiri

        Related Publications

        • Abhishek Jha, Vinay P. Namboodiri and C.V. Jawahar - Spotting words in silent speech videos: a retrieval-based approach,  Machine Vision and Applications 2019 [PDF]

        • Abhishek Jha, Vinay Namboodiri and C.V. Jawahar -  Word Spotting in Silent Lip Videos, IEEE Winter Conference on Applications of Computer Vision (WACV 2018), Lake Tahoe, CA, USA, 2018 [PDF]


        Downloads

        thesis

        On Compact Deep Neural Networks for Visual Place Recognition, Object Recognition and Visual Localization

         


        Soham Saha

        Abstract

        There has been an immense increase in the use of Deep Neural Networks in recent times due to the availability of more data and greater computing power. With their recent success, it has been a trend to use them extensively in real-time applications. However, the size of deep models can render them incapable to be used in devices with memory-constraints. In this thesis, we explore the several neural network compression techniques for three separate tasks namely i) visual place recognition, ii) object recognition and iii) visual localization. We explore explicit compression methods for the visual place recognition task and the object recognition task, achieved by making modifications to the learned weight matrices. Furthermore, we look at compression attained through architectural modifications in the network itself, proposing novel training procedures and new loss functions for object recognition and visual locali zation. The task of visual place recognition requires us to correctly identify a place given its image, by finding out images of the same place in the world(dataset). Performing this on low memory devices such as mobile phones and robotics systems, is a challenging problem. The state of the art models for this task uses deep learning architectures having close to 100 million parameters which take over 400MB of memory. This makes these models infeasible to be deployed in low memory devices and gives rise to the need of compressing them. Hence, we study the effectiveness of explicit model compression techniques like trained quantization and pruning, on one of the most effective visual place recognition models. We show that a compressed network can be created by starting with a pre-trained model and then fine-tuning it via trained pruning and quantization. Through this training method, the compressed model is able to produce the same mAP as the original uncompressed network. We achieve almost 50% parameter reduction through pruning with no loss in mAP and 70% reduction with close to 2% mAP reduction, while also performing trained 8-bit quantization. Furthermore, together with 5-bit quantization, we perform about 50% parameter reduction by pruning and get only about 3% reduction in mAP. The resulting compressed networks have sizes of around 30 MB and 65 MB which makes them easily usable in memory constrained devices. We next move on to compression through low rank approximation for the task of image classification. Traditional compression algorithms in deep networks involves performing low-rank approximations on the learned weight matrices after the training procedure has been completed. We propose to perform low rank approximation during training itself and make the parameters of the approximated matrix learnable too by using a suitable loss function. We show that by using our method, we are able to compress a base-model providing 89% accuracy, by 10x, with some loss in performance. Using our compression based training procedure, our compressed model is able to achieve an accuracy of about 84%. Next, we focus on developing compressed models for the object recognition task and propose a novel architecture for the same. Deep neural networks for image classification typically consists of a convolutional feature extractor followed by a fully connected classifier network. The predicted and the ground truth labels are represented as one hot vectors. Such a representation assumes that all classes are equally dissimilar. However, classes have visual similarities and often form a hierarchy. We propose an alternate architecture for the classifier network called the Latent Hierarchy Classifier which can discover a latent hierarchy of the classes while simultaneously reducing the number of parameters used in the original classifier. We show that, for some of the best performing architectures on CIFAR and Imagenet datasets, our proposed alternate classifier and training procedure, recovers the accuracy. Also, our proposed method significantly reduces the parameter complexity of the classifier. We achieve a reduction in the number of parameters of the classification layer by 98% for CIFAR 100 and 41% for the Imagenet 1K dataset. We also verify that many visually similar classes are grouped together, under the learnt hierarchy. Finally, we address the problem of Visual Localization where the task is to predict the camera orientation and pose of the given input scene. We propose an anchor point classification based solution for this task by using single camera images only. Our proposed three-way branching of the feature extractor into an Anchor Point Classifier, a Relative Offset Regressor and an Absolute Regressor, is able to achieve <2m translation localization and <5 ◦ pose localization on the Cambridge Landmarks dataset, while also obtaining state-of-the-art in median distance localization for orientation for all the 6 scenes. Our method not only uses fewer parameters than previous deep learning based methods but also improves on memory footprint as well as test-time over nearest neighbour based approaches.

         

        Year of completion:  April 2019
         Advisor : C V Jawahar, Girish Varma

        Related Publications


          Downloads

          thesis

          More Articles …

          1. Road Topology Extraction from Satellite images by Knowledge Sharing
          2. Towards Scalable Applications for Handwritten Documents
          3. Analyzing Racket Sports From Broadcast Videos
          4. Efficient Annotation and Knowledge Distillation for Semantic Segmentation
          • Start
          • Prev
          • 20
          • 21
          • 22
          • 23
          • 24
          • 25
          • 26
          • 27
          • 28
          • 29
          • Next
          • End
          1. You are here:  
          2. Home
          3. Research
          4. MS Thesis
          5. Thesis Students
          Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.