CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us
  • Login

Tackling Low Resolution for Better Scene Understanding


Harish Krishna

Abstract

Complete scene understanding has been an aspiration of computer vision since its very early days. It has applications in autonomous navigation, aerial imaging, surveillance, human-computer interaction among several other active areas of research. While many methods since the advent of deep learninghave taken performance in several scene understanding tasks to respectable levels, the tasks are far from being solved. One problem that plagues scene understanding is low-resolution. Convolutional Neural Networks that achieve impressive results on high resolution struggle when confronted with low resolution because of the inability to learn hierarchical features and weakening of signal with depth. In this thesis, we study the low resolution and suggest approaches that can overcome its consequences on three popular tasks - object detection, in-the-wild face recognition, and semantic segmentation. The popular object detectors were designed for, trained, and benchmarked on datasets that have a strong bias towards medium and large sized objects. When these methods are finetuned and tested on a dataset of small objects, they perform miserably. The most successful detection algorithms follow a two-stage pipeline: the first which quickly generates regions of interest that are likely to contain the object and the second, which classifies these proposal regions. We aim to adapt both these stages for the case of small objects; the first by modifying anchor box generation based on theoretical considerations, and the second using a simple-yet-effective super-resolution step. Motivated by the success of being able to detect small objects, we study the problem of detecting and recognising objects with huge variations in resolution, in the problem of face recognition in semi-structured scenes. Semi-structured scenes like social settings are more challenging than regular ones: there are several more faces of vastly different scales, there are large variations in illumination, pose and expression, and the existing datasets do not capture these variations. We address the unique challenges in this setting by (i) benchmarking popular methods for the problem of face detection, and (ii) proposing a method based on resolution-specific networks to handle different scales. Semantic segmentation is a more challenging localisation task where the goal is to assign a semantic class label to every pixel in the image. Solving such a problem is crucial for self-driving cars where we need sharper boundaries for roads, obstacles and paraphernalia. For want of a higher receptive field and a more global view of the image, CNN networks forgo resolution. This results in poor segmentation of complex boundaries, small and thin objects. We propose prefixing a super-resolution step before semantic segmentation. Through experiments, we show that a performance boost can be obtained on the popular streetview segmentation dataset, CityScapes.

Year of completion:  July 2018
 Advisor : Prof. C V Jawahar

Related Publications


    Downloads

    thesis

    Combining Class Taxonomies and Multi Task Learning To Regularize Fine-grained Recognition


    Riddhiman Dasgupta

    Abstract

    Fine-grained classification is an extremely challenging problem in computer vision, impaired by subtle differences in shape, pose, illumination and appearance, and further compounded by subtle intra-class differences and striking inter-class similarities. While convolutional neural networks have become versatile jack-of-all-trades tool in modern computer vision, approaches for fine-grained recognition still rely on localization of keypoints and parts to learn discriminative features for recognition. In order to achieve this, most approaches necessitate copious amounts of expensive manual annotations for bounding boxes and keypoints. As a result, most of the current methods inevitably end up becoming complex, multi-stage pipelines, with a deluge of tunable knobs, which makes it infeasible to reproduce them or deploy them for any practical scenario. Since image level annotation is prohibitively expensive for most fine-grained problems, we look at the problem from a rather different perspective, and try to reason about what might be the minimum amount of additional annotation that might be required to obtain an improvement in performance on the challenging task of fine-grained recognition. In order to tackle this problem, we aim to leverage the (taxonomic and/or semantic) relationships present among fine-grained classes. The crux of our proposed approach lies in the notion that fine-grained recognition effectively deals with subordinate-level classification, and as such, subordinated classes imply the presence of inter-class and intra-class relationships. These relationships may be taxonomical, such as super-classes, and/or semantic, such as attributes or factors, and are easily obtainable in the sense that domain expertise is needed for each fine-grained label, not for each image separately. We propose to exploit the rich latent knowledge embedded in these inter-class relationships for visual recognition. We posit the problem as a multi-task learning problem where each different label obtained from inter-class relationships can be treated as a related yet different task for a comprehensive multi-task model. Additional tasks/labels, which might be super-classes or attributes, or factor-classes can act as regularizers, and increase the generalization capabilities of the network. Class relationships are almost always a free source of labels that can be used as auxiliary tasks to train a multi-task loss which is usually a weighted sum of the different individual losses. Multiple tasks will try to take the network in diverging directions, and the network must reach a common minimum by adapting and learning features common to all tasks in its shared layers. Our main contribution is to utilize the taxonomic/semantic hierarchies among classes, where each level in the hierarchy is posed as a classification problem, and solved jointly using multi-task learning. We employ a cascaded multi-task network architecture, where the output of one task feeds into the next, thusenabling transfer of knowledge from the easier tasks to the more difficult ones. To gauge the relative importance of tasks, and apply appropriate learning rates for each task to ensure that the related tasks aid and unrelated tasks does not hamper performance on the primary task, we propose a novel task-wise dynamic coefficient which controls its contribution to the global objective function. We validate our proposed methods for improving fine-grained recognition via multi-task learning using class taxonomies on two datasets, viz. CIFAR 100, which has a simple 2 level hierarchy, albeit a bit noisy, which we use to estimate how robust our proposed approach is to hyperparameter sensitivities, and CUB-200-2011, which has a 4 level hierarchy, and is a more challenging real-world dataset in terms of image size, which we use to see how transferable our proposed approach is to pre-trained networks and fine-tuning. We perform ablation studies on CIFAR 100 to establish the usefulness of multi-task learning using hierarchical labels, and measure the sensitivity of our proposed architectures to different hyperparameters and design choices in an imperfect 2 level hierarchy. Further experiments on the popular, real-world, large-scale, fine-grained CUB-200-2011 dataset with a 4 level hierarchy re-affirm our claim that employing super-classes in an end-to-end model improves performance, compared to methods employing additional expensive annotations such as keypoints and bounding boxes and/or using multi-stage pipelines. We also prove the improved generalization capabilities of our multi-task models, by showing how multiple connected tasks act as regularizers, reducing the gap between training and testing errors. Additionally, we demonstrate how dynamically estimating auxiliary task relatedness and updating auxiliary task coefficients is more optimal than manual hyperparameter tuning for the same purpose.

    Year of completion:  July 2018
     Advisor : Prof. Anoop M Namboodiri

    Related Publications

    • Riddhiman Dasgupta and Anoop Namboodiri - Leveraging multiple tasks to regularize fine-grained classification Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016. [PDF]

    • Koustav Ghosal, Ameya Prabhu, Riddhiman Dasgupta, Anoop M. Namboodiri - Learning Clustered Sub-spaces for Sketch-based Image Retrieval Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, 03-06 Nov 2015, Kuala Lumpur, Malaysia. [PDF]


    Downloads

    thesis

    Cognitive Vision: Examining Attention, Engagement and Cognitive load via Gaze and EEG


    Viral Parekh

    Abstract

    Gaze and visual attention are very related. Analysis of gaze and attention can be used for behavior analysis, anticipation or to predict the engagement level of a person. Collectively all these problems fall into the space of cognitive vision systems. As the name suggests it is the intersection of two areas: computer vision and cognition. The goal of the cognitive vision system is to understand the principles of human vision and use them as inspiration to improve machine vision systems. In this thesis, we have focused on Eye gaze and Electroencephalogram (EEG) data to understand and analyze the attention, cognitive workload and demonstrated a few applications like engagement analysis and image annotation. With the presence of ubiquitous devices in our daily lives, effectively capturing and managing user attention becomes a critical device requirement. Gaze-lock detection to sense eye-contact with a device is a useful technique to track user’s interaction with the device. We propose an eye contact detection using a convolutional neural network (CNN) architecture, which achieves superior eye-contact detection performance as compared to state of the art methods with minimal data pre-processing; our algorithm is furthermore validated on multiple datasets, Gaze-lock detection is improved by combining head pose and eye-gaze information consistent with social attention literature. Further, we extend our work to analyze the engagement level in the person with dementia via visual attention. Engagement in dementia is typically measured using behavior observational scales (BOS) that are tedious and involve intensive manual labor to annotate, and are therefore not easily scalable. We propose AVEID, a low-cost and easy to use video-based engagement measurement tool to determine the level of engagement of a person with dementia (PwD) when interacting with a target object. We show that the objective behavioral measures computed via AVEID correlate well with subjective expert impressions for the popular MPES and OME BOS, confirming its viability and effectiveness. Moreover, AVEID measures can be obtained for a variety of engagement designs, thereby facilitating large-scale studies with PwD populations. Analysis of Cognitive load for a given user interface is an important measure of effectiveness or usability. We examine whether EEG-based cognitive load estimation is generalizable across the character, spatial pattern, bar graph and pie chart-based visualizations for the n-back task. Cognitive load is estimated via two recent approaches: (1) Deep convolutional neural network and Proximal support vector machines. Experiments reveal that cognitive load estimation suffers across visualizations suggesting that (a) they may inherently induce varied cognitive processes in users, and (b) effective adaptation techniques are needed to benchmark visual interfaces for usability given pre-defined tasks. Finally, the success of deep learning in computer vision has greatly increased the need for annotated image datasets. We propose an EEG (Electroencephalogram)-based image annotation system. While humans can recognize objects in 20-200 milliseconds, the need to manually label images results in a low annotation throughput. Our system employs brain signals captured via a consumer EEG device to achieve an annotation rate of up to 10 images per second. We exploit the P300 event-related potential (ERP) signature to identify target images during a rapid serial visual presentation (RSVP) task. We further perform unsupervised outlier removal to achieve an F1-score of 0.88 on the test set. The proposed system does not depend on category-specific EEG signatures enabling the annotation of any new image category without any model pre-training.

     

    Year of completion:  July 2018
     Advisor : Prof. C.V. Jawahar and Ramanathan Subramanian

    Related Publications

    • Viral Parekh, Ramanathan Subramanian, Dipanjan Roy C.V. Jawahar - An EEG-based Image Annotation System - National Conference on Computer Vision Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2017 [PDF]


    Downloads

    thesis

    Exemplar based approaches on Face Fiducial Detection and Frontalization


    Mallikarjun BR

    Abstract

    Computer vision solutions such as face detection and recognition, facial reenactment, facial expression analysis and gender detection have seen fruitful applications in various domains such as security, surveillance, social media and animation. Many of the above solutions have common pre-processing steps such as fiducial detection, appearance modeling, face structural modelings etc. These steps can be considered as fundamental problems to be solved in building any computer vision solutions concerning face images. In this thesis, we propose exemplar based approaches to solve two fundamental problems, such as face fiducial detection and face frontalization. Exemplar based approaches have been proved to work in various computer vision problems, such as object detection, image impainting, object removal, action recognition, gesture recognition. This approach directly utilizes the information residing in the examples to achieve a certain objective, instead of coming up with a model representing all the examples and has shown to be effective. Face fiducial detection involves detecting key points on the faces such as eye corner, nose tip, mouth tips etc. It is one of the main pre-processing step done for face recognition, facial animation, gender detection, gaze identification and expression recognition systems. Number of different approaches like active shape models, regression based methods, cascaded neural networks, tree based methods and exemplar based approaches have been proposed in the recent past. Many of these algorithms only address part of the problems in this area. We propose an exemplar based approach which takes advantage of the complimentarity of different approaches and obtain consistently superior performance over the state-of-the-art methods. We provide extensive experiments over three popular datasets. Face frontalization is the process of synthesizing frontal view of the face given a non-frontal view. Method proposed for frontalization can be used in intelligent photo editing tools and also aids in improving the accuracy of face recognition systems. Methods previously proposed involve estimating the 3D model or assuming a generic 3D model of the face. Estimating an accurate 3D model of the face is not a completely solved problem and assumption of generic 3D model of the face results in loss of crucial shape cues. We propose an exemplar based approach which does not require 3D model of the face. We show that our method is efficient and performs consistently better than other approaches.

     

    Year of completion:  May 2017
     Advisor : C V Jawahar

    Related Publications

    • Mallikarjun B R, Visesh Chari, C. V. Jawahar , Akshay Asthana - Face Fiducial Detection by Consensus of Exemplars Proceedings of the IEEE Winter Conference on Applications of Computer Vision(WACV), 2016. [PDF]

    • Mallikarjun B.R., C.V. Jawahar - Efficient Face Frontalization in Unconstrained Images Proceedings of the Fifth National Conference on Computer Vision Pattern Recognition, Image Processing and Graphics (NCVPRIPG 2015), 16-19 Dec 2015, Patna, India.


    Downloads

    thesis

    Visual Perception Based Assistance for Fundus Image Readers


    Rangrej Bharat

    Abstract

    Diabetic Retinopathy (DR) is a condition where individuals with diabetes develop a disease in the inner wall of eye known as retina. DR is a major cause of visual impairments and early detection can prevent vision loss. Use of automatic systems for DR diagnosis is limited due to their lower accuracy. As an alternative, reading-centers are becoming popular in real-world scenarios. Reading center is a facility where retinal images coming from various sources are stored and trained personals(who might not be experts) analyze them. In this thesis we look at techniques to increase efficiency of DR image-readers working in reading centers. The first half of this thesis aims at identifying efficient image-reading technique which is both fast and accurate. Towards this end we have conducted an eye-tracking study with medical experts while they were reading images for DR diagnosis. The analysis shows that experts employ mainly two types of reading strategies: dwelling and tracing. Dwelling strategy appears to be accurate and faster than tracing strategy. Eye movements of all the experts are combined in a novel way to extract an optimal image scanning strategy, which can be recommended to image-readers for efficient diagnosis. In order to increase the efficiency further, we propose a technique where saliency of lesions can be boosted for better visibility of lesions. This is named as an Assistive Lesion Emphasis System(ALES) and demonstrated in the second half of the thesis. ALES is developed as a two stage system: saliency detection and lesion emphasis. Two biologically inspired saliency models, which mimic human visual system, are designed using unsupervised and supervised techniques. Unsupervised saliency model is inspired from human visual system and achieved 10% higher recall than other existing saliency models when compared with average gazemap of 15 retinal experts. Supervised saliency model developed as deep learning based implementation of biologically inspired saliency model proposed by Itti-Koch(Itti, L., Koch, C. and Niebur, E., 1998) and achieves 10% to 20% higher AUC compared to existing saliency model when compared with manual markings. Saliency maps generated by these models are used to boost the prominence of lesions locally. This is done using two types of selective enhancement techniques. One technique uses multiscale fusion of saliency map with original image and other uses spatially varying gamma correction; both increases CNR of lesions by 30%. One saliency model and one selective-enhancement technique are clubbed together to illustrate two complete ALESs. DR diagnosis done by analyzing ALES output using optimal strategy should presumably be faster and accurate.

     

    Year of completion:  July 2017
     Advisor : Jayanthi Sivaswamy

    Related Publications

    • Karthik G, Rangrej S and Jayanthi Sivaswamy - A deep learning framework for segmentation of retinal layers from OCT images ACPR, Nanjing [PDF]

    • Rangrej Bbharat and Jayanthi Sivaswamy - ALES: an assistive system for fundus image readers Journal of Medical Imaging, 4, Issue: 02, April 2017 [PDF]

    • Samrudhdhi B. Rangrej and Jayanthi Sivaswamy - A biologically inspired saliency model for color fundus images Proceedings of the Tenth Indian Conference on Computer Vision, Graphics and Image Processing. ACM, 2016. [PDF]


    Downloads

    thesis

    More Articles …

    1. A design for an automated Optical Coherence Tomography analysis system
    2. Active Learning & its Applications
    3. Population specific template construction and brain structure segmentation using deep learning methods
    4. Error Detection and Correction in Indic OCRs
    • Start
    • Prev
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • Next
    • End
    1. You are here:  
    2. Home
    3. Research
    4. MS Thesis
    5. Thesis Students
    Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.