Thesis Students

Retinal Image Synthesis

Anurag Anil Deshmukh

Abstract

Medical imaging has been aiding diagnosis and treatment of diseases by creating visual representations of the interior of the human body. Experts hand-mark these images for abnormalities and diagnosis. Supplementing experts with these rich visualization has enabled detailed clinical analysis and rapid medical intervention. However, deep learning-based methods rely on abundantly large volumes of data for training. Procuring data for medical imaging applications is especially difficult because abnormal cases by definition are rare and the data, in general, requires experts for labelling. With Deep learning algorithms, data with high class imbalance or of insufficient variability leads to poor classification performance. Thus, alternate approaches like using generative modelling to artificially generate more data have been of interest. Most of these methods are GAN [11] based approaches. While they can be helpful with data imbalance they still require a lot of data to be able to generate realistic images. Additionally, a lot of these methods have been shown to work on natural images where the images are relatively noise-free and smaller artifacts aren’t as damaging. Thus, this thesis aims at providing synthesis methods which overcome the limitations of smaller datasets and noisy profile. We do this for two different modalities, Fundus imaging and Optical Coherence Tomography (OCT). Firstly, we present a fundus image synthesis method aimed at providing paired Optic Cup and Image data for Optic Cup (OC) Segmentation. The synthesis method works well on small datasets by minimising the information to be learnt by leveraging domain-specific knowledge and by providing most of the structural information to the network. We demonstrate this method’s advantages over a more direct synthesis method. We show how leveraging domain-specific knowledge can provide higher quality images and annotations. Inclusion of these generated images and their annotations in training of an OC segmentation model showed a significant improvement in performance, thus showing their reliability. Secondly, we present a novel unpaired image to image translation method which can introduce abnormality (Drusen) to OCT images while avoiding artifacts and preserving the noise profile. Comparison with other state-of-the-art images to image translation methods shows that our method is significantly better at preserving the noise profile and is better at generating morphologically accurate structures.

Year of completion:	April 2021
Advisor :	Jayanthi Sivaswamy

Related Publications

Downloads

Leveraging Structural Cues for Better Training and Deployment in Computer Vision

Shyamgopal Karthik

Abstract

With the growing use of computer vision tools in wide-ranging applications, it becomes imperative to understand and resolve issues in computer vision models when they are used in production settings for various applications. In particular, it is essential to understand that the model can be wrong quite frequently during deployment. Developing a better understanding of the mistakes made by a model can help mitigate and handle them without catastrophic consequences. To investigate the severity of mistakes, we first explore this in a simple classification setting. Even in this setting, understanding the severity of mistakes of difficult to quantify, especially since manually defining pairwise costs does not scale well for large-scale classification datasets. Therefore most works have used class taxonomies/hierarchies, which allow pairwise costs to be defined using graph distances. There has been increasing interest in building deep hierarchy-aware classifiers, aiming to quantify and reduce the severity of mistakes and not just count the number of errors. However, most of these works require the hierarchy to be available during training and cannot adapt to new hierarchies or even small modifications to the existing hierarchy without having to re-train the model. We explore a different direction for hierarchy-aware classification – amending mistakes by making post-hoc corrections by resorting to the classical Conditional Risk Minimization(CRM). Surprisingly, we find that this method is a far more suitable alternative than the works on deep hierarchy-aware classification; CRM preserves the base model’s top-1 accuracy and brings the most likely predictions of the model closer to the ground truth and is able to provide reliable probability estimates, unlike hierarchy-aware classifiers. We firmly believe that this serves as a very strong and useful baseline for future exploration in this direction. We turn our attention to a crucial problem in many video processing pipelines: visual(single) object tracking. In particular, we explore the long-term tracking scenario where given a target in the first frame of the video; the goal is to track the object throughout a (long) video during which the object may undergo occlusion, vary in appearance, or go out-of-view. The temporal aspect of videos also makes it an ideal scenario to understand the accumulation of errors that would not be otherwise seen if every image is independent. We hypothesize that there are three crucial abilities that a tracker must possess to be effective in the long-term tracking scenario, namely Re-Detection, Recovery and, Reliability. The tracker must be able to re-detect the target when the target goes out of the scene, and returns must recover from failure and track an object contiguously to be of practical utility. We propose a set of novel and comprehensive experiments to understand each of these aspects which give a thorough understanding of the strengths and limitations of various state-of-the-art tracking algorithms. We finally visit the problem of multi-object tracking. Unlike the problem of single-object tracking where the target is initialized in the first frame, the goal here is to track all objects of a particular category(such as pedestrians, vehicles, animals etc.). Since this problem does not require user-initialization, it has found use in wide-ranging real-time applications such as autonomous driving. The typical multiobject tracking pipeline follows the tracking-by-detection paradigm, i.e. an object detector is first used to detect all the objects in the scene. These detections are linked together to form the final trajectories using a combination of Spatio-temporal features and appearance/Re-Identification(ReID) features. The appearance features are extracted using a Convolutional Neural Network(CNN) trained on a corpus of labelled videos. Our central insight is that only the appearance model requires labelled videos in the entire pipeline, while the rest of the pipeline can be trained with just image-level supervision. Inspired by the recent successes in unsupervised contrastive learning which enforces the similarity in feature space between an image and its augmented version, we resort to a simple method that leverages the spatio-temporal consistency in videos to generate “natural” augmentations which are then used as pseudo-labels to train the appearance model. When integrated into the overall tracking pipeline, we find that this unsupervised appearance model can match the performance of its supervised counterparts in reducing the identity switches present in the trajectories, thereby saving costly video annotations that are impractical to scale up without sacrificing performance.

Year of completion:	April 2021
Advisor :	Vineet Gandhi

Related Publications

Downloads

Human head pose and emotion analysis

Aryaman Gupta

Abstract

Scene analysis has been a topic of great interest in computer vision. Humans are the most important and most complex subject involved in scene analysis. Humans exhibit different forms of expressions and behaviour with its environment. These interactions with its environment have been in study for a long time and to interpret these interactions various challenges and tasks have been identified. We focus on two tasks in particular: Head Pose estimation and Emotion recognition. Head poses are an important mean of non-verbal human communication and thus a crucial element in understanding human interaction with its environment. Head pose estimation allows a robot to estimate the region of focus of attention for an individual. Head pose estimation requires learning a model that computes the intrinsic Euler angles for pose (yaw, pitch, roll) from an input image of the human face. Annotating ground truth head pose angles for images in the wild is difficult and requires ad-hoc fitting procedures (which provides only coarse and approximate annotations). This highlights the need for approaches which can train on data captured in a controlled environment and generalize on the images in the wild (with varying appearance and illumination of the face). Most present day deep learning approaches which learn a regression function directly on the input images fail to do so. To this end, we propose to use a higher level representation to regress the head pose while using deep learning architectures. More specifically, we use the uncertainty maps in the form of 2D soft localization heatmap images over five facial keypoints, namely left ear, right ear, left eye, right eye and nose, and pass them through a convolutional neural network to regress the head-pose. We show head pose estimation results on two challenging benchmarks BIWI and AFLW and our approach surpasses the state of the art on both the datasets. We also propose a synthetically generated dataset for head pose estimation. Emotions are fundamental to human lives and decision-making. Human emotion detection can be helpful in understanding human mood, intent or choice of action. Recognizing emotions from images or video accurately is not easy for humans themselves and for machines it is even more challenging as humans express their emotions in different forms and there is a lack of temporal boundaries among emotions. Facial Expression Recognition has remained a challenging and interesting problem in computer vision. Despite efforts made in developing various methods for facial expression recognition, existing approaches lack generalizability when applied to unseen images or those that are captured in wild setting (i.e. the results are not significant). We propose use of facial action unit’s soft localization heatmap images for facial expression recognition. To account for lack of large well labelled dataset we propose a method for automated spectrogram annotation where we use two modalities(visual and textual) used in expression of emotion by humans to label one other modality(speech) for emotion recognition.

Year of completion:	March 2021
Advisor :	Vineet Gandhi

Related Publications

Downloads

Super-resolution of Digital Elevation Models With Deep Learning Solutions

Kubade Ashish Ashokrao

Abstract

Terrain, representing features of an earth surface, plays a crucial role in many applications such as simulations, hazard prevention and mitigation planning, route planning, analysis of surface dynamics, computer graphics-based games, entertainment, films, to name a few. With recent advancements in digital technology, these applications demand the presence of high-resolution details in the terrain. However, currently available public datasets, providing terrain scans in the form of Digital Elevation Models (DEMs) have low resolution compared with the terrain information available in other modalities like aerial images. Publicly available DEM datasets for most parts of the world have a resolution of 30 m whereas the aerial images or satellite images are available at a resolution of 50 cm. The cost involved in capturing of such high-resolution DEMs (HRDEMs) turns out to be a major hurdle for making such high-resolution available in the public domain. This motivates us to provide a software solution for generating high-resolution DEM from the existing low-resolution DEMs (LRDEMs). In natural image domain, super-resolution has set up higher benchmarks by incorporating deep learning based solutions. Despite such tremendous success in image super-resolution task using deep learning solutions, there are very few works that have used these powerful systems on DEMs to generate HRDEMs. A few of them used additional modalities as aerial images or satellite images, temporal sequence of DEMs etc., to generate high-resolution terrains. However, the applicability of these methods is highly subject to the available input formats. In this research effort, we explore a new direction in DEM super-resolution by using feedback neural networks. Availing the capability of feedback neural networks to redefine the features learned by shallow layers of the network, we design DSRFB, a DEM super-resolution architecture that generates high-resolution DEM with a super-resolution factor of 8X with minimal input. Our experiments on Pyrenees and Tyrol mountain range datasets show that DSRFB can perform near to the state-of-the-art without using information from any additional modalities like aerial images. Further, by understanding the limitations of DSRFB, which primarily occur in case of highly degraded low-resolution input. In such cases, the major structures are entirely lost and the reconstruction becomes challenging. In such cases, to avail the elevation cues from alternate sources of information becomes necessary. To utilize such information from other modalities, we inherit the attention mechanism from natural language processing (NLP) domain. We integrate the attention mechanism into the feedback network to present Attentional Feedback Module (AFM). Our proposed network, Attentional Feedback vivii Network (AFN) with AFM as a backbone, outperforms the state-of-the-art methods with the best margin of 7.2%. We also emphasize on the reconstruction of the structures across patch boundaries. While generating HRDEM by splitting large DEM tiles into patches, we propose to use overlapped tiles and generate an aggregated response to dilute the artefacts due to structural discontinuities. To summarize, in this research, we propose two methods DSRFB and AFN to generate a high- resolution DEM from existing low-resolution DEM. While DSRFB achieves near to the state-of-the-art performance, coupling DSRFB with attentional mechanism (i.e., AFN) outperforms state-of-the-art methods.

Year of completion:	March 2021
Advisor :	Avinash Sharma,K S Rajan

Related Publications

Downloads

Enhancing OCR Performance with Low Supervision

Deepayan Das

Abstract

Over the last decade, a tremendous emphasis has been laid on collection and digitization of a vast number of books leading to the creation of so-called ‘Digital Libraries’. Projects like Google Book and Project Gutenberg have made significant progress in digitizing over millions of books and making it available to the public. Efforts have also been made from the perspective of Indic languages where the task to identify and recognize books from several Indian languages has been undertaken by the National Digital Library of India. Advantages of digital libraries can be manifold. Digitization of ancient manuscripts ensures the preservation of knowledge and promotes research. Books in digital libraries are indexed which facilitates easy search and retrieval. They are easy to store and do not take as much effort in maintenance as their physical counterparts. One of the most important steps in the digitization effort is the recognition and conversion of physical pages into editable text using an OCR. There are commercial OCRs available like Tesseract and Abby fine reader, however, the ability of an OCR to recognize text without committing too many errors depends very much on the print quality of the pages as well as font style of the type-written text. A pre-trained OCR will invariably make errors across pages whose distribution is different in terms of fonts and print quality from the pages on which it was trained. If the domain gap is too large then the number of error words will be too high which will result in investing significant effort in the correction. Since the books need to be indexed, one cannot afford to have too many word errors in the OCR recognized pages. Thus, a major effort must be spent on correcting the error words, misclassified by the OCR. Manually correcting each isolated error word will incur a huge cost and is infeasible. In this thesis, we look at methods to improve OCR accuracy with minimum human involvement. To this effect, we propose two approaches. In the first approach, we strive to improve the OCR performance via an efficient post- processing technique where we aim to group similar erroneous words and correct them simultaneously. We argue that since a book has a common underlying theme, it will contain many word repetitions. These word co-occurrences can be taken advantage of by grouping similar error words and correcting them in batches. We propose a novel clustering scheme which combines features from both images as well as its text transcription to group error word predictions. The grouped error predictions can then be corrected either automatically or with the help of a human annotator. We show via experimental verification that automatic correction of error word batches might not be the most efficient way to correct the error words and employing a human annotator to verify the error word clusters will be a more systematic way to address the issue. Next, we look at the problem of adapting an OCR to a new dataset without requiring too many annotated pages. Traditional norm dictates finetuning the existing OCR on a portion of target data. However, even annotating a portion of data to create image-label pairs can be a costly affair. For this, we employ a self-training approach where the OCR is finetuned on its own predictions from the target dataset. To curtail the effects of noise present in the predictions, we include only those samples in the training set on which the model is sufficiently confident. We also show that by employing various regularization strategies we can outperform the traditional finetuning method without the need for any additional labelled data. We further show that by combining self-training with finetuning we can achieve a maximum gain in terms of OCR accuracy across all the datasets. We furnish thorough empirical evidence to support all our claims.

Year of completion:	March 2021
Advisor :	C V Jawahar

Retinal Image Synthesis

Anurag Anil Deshmukh

Abstract

Related Publications

Downloads

Leveraging Structural Cues for Better Training and Deployment in Computer Vision

Shyamgopal Karthik

Abstract

Related Publications

Downloads

Human head pose and emotion analysis

Aryaman Gupta

Abstract

Related Publications

Downloads

Super-resolution of Digital Elevation Models With Deep Learning Solutions

Kubade Ashish Ashokrao

Abstract

Related Publications

Downloads

Enhancing OCR Performance with Low Supervision

Deepayan Das

Abstract

Related Publications

Downloads

More Articles …