CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Human head pose and emotion analysis


Aryaman Gupta

Abstract

Scene analysis has been a topic of great interest in computer vision. Humans are the most important and most complex subject involved in scene analysis. Humans exhibit different forms of expressions and behaviour with its environment. These interactions with its environment have been in study for a long time and to interpret these interactions various challenges and tasks have been identified. We focus on two tasks in particular: Head Pose estimation and Emotion recognition. Head poses are an important mean of non-verbal human communication and thus a crucial element in understanding human interaction with its environment. Head pose estimation allows a robot to estimate the region of focus of attention for an individual. Head pose estimation requires learning a model that computes the intrinsic Euler angles for pose (yaw, pitch, roll) from an input image of the human face. Annotating ground truth head pose angles for images in the wild is difficult and requires ad-hoc fitting procedures (which provides only coarse and approximate annotations). This highlights the need for approaches which can train on data captured in a controlled environment and generalize on the images in the wild (with varying appearance and illumination of the face). Most present day deep learning approaches which learn a regression function directly on the input images fail to do so. To this end, we propose to use a higher level representation to regress the head pose while using deep learning architectures. More specifically, we use the uncertainty maps in the form of 2D soft localization heatmap images over five facial keypoints, namely left ear, right ear, left eye, right eye and nose, and pass them through a convolutional neural network to regress the head-pose. We show head pose estimation results on two challenging benchmarks BIWI and AFLW and our approach surpasses the state of the art on both the datasets. We also propose a synthetically generated dataset for head pose estimation. Emotions are fundamental to human lives and decision-making. Human emotion detection can be helpful in understanding human mood, intent or choice of action. Recognizing emotions from images or video accurately is not easy for humans themselves and for machines it is even more challenging as humans express their emotions in different forms and there is a lack of temporal boundaries among emotions. Facial Expression Recognition has remained a challenging and interesting problem in computer vision. Despite efforts made in developing various methods for facial expression recognition, existing approaches lack generalizability when applied to unseen images or those that are captured in wild setting (i.e. the results are not significant). We propose use of facial action unit’s soft localization heatmap images for facial expression recognition. To account for lack of large well labelled dataset we propose a method for automated spectrogram annotation where we use two modalities(visual and textual) used in expression of emotion by humans to label one other modality(speech) for emotion recognition.

Year of completion:  March 2021
 Advisor : Vineet Gandhi

Related Publications


    Downloads

    thesis

    Super-resolution of Digital Elevation Models With Deep Learning Solutions


    Kubade Ashish Ashokrao

    Abstract

    Terrain, representing features of an earth surface, plays a crucial role in many applications such as simulations, hazard prevention and mitigation planning, route planning, analysis of surface dynamics, computer graphics-based games, entertainment, films, to name a few. With recent advancements in digital technology, these applications demand the presence of high-resolution details in the terrain. However, currently available public datasets, providing terrain scans in the form of Digital Elevation Models (DEMs) have low resolution compared with the terrain information available in other modalities like aerial images. Publicly available DEM datasets for most parts of the world have a resolution of 30 m whereas the aerial images or satellite images are available at a resolution of 50 cm. The cost involved in capturing of such high-resolution DEMs (HRDEMs) turns out to be a major hurdle for making such high-resolution available in the public domain. This motivates us to provide a software solution for generating high-resolution DEM from the existing low-resolution DEMs (LRDEMs). In natural image domain, super-resolution has set up higher benchmarks by incorporating deep learning based solutions. Despite such tremendous success in image super-resolution task using deep learning solutions, there are very few works that have used these powerful systems on DEMs to generate HRDEMs. A few of them used additional modalities as aerial images or satellite images, temporal sequence of DEMs etc., to generate high-resolution terrains. However, the applicability of these methods is highly subject to the available input formats. In this research effort, we explore a new direction in DEM super-resolution by using feedback neural networks. Availing the capability of feedback neural networks to redefine the features learned by shallow layers of the network, we design DSRFB, a DEM super-resolution architecture that generates high-resolution DEM with a super-resolution factor of 8X with minimal input. Our experiments on Pyrenees and Tyrol mountain range datasets show that DSRFB can perform near to the state-of-the-art without using information from any additional modalities like aerial images. Further, by understanding the limitations of DSRFB, which primarily occur in case of highly degraded low-resolution input. In such cases, the major structures are entirely lost and the reconstruction becomes challenging. In such cases, to avail the elevation cues from alternate sources of information becomes necessary. To utilize such information from other modalities, we inherit the attention mechanism from natural language processing (NLP) domain. We integrate the attention mechanism into the feedback network to present Attentional Feedback Module (AFM). Our proposed network, Attentional Feedback vivii Network (AFN) with AFM as a backbone, outperforms the state-of-the-art methods with the best margin of 7.2%. We also emphasize on the reconstruction of the structures across patch boundaries. While generating HRDEM by splitting large DEM tiles into patches, we propose to use overlapped tiles and generate an aggregated response to dilute the artefacts due to structural discontinuities. To summarize, in this research, we propose two methods DSRFB and AFN to generate a high- resolution DEM from existing low-resolution DEM. While DSRFB achieves near to the state-of-the-art performance, coupling DSRFB with attentional mechanism (i.e., AFN) outperforms state-of-the-art methods.

    Year of completion:  March 2021
     Advisor : Avinash Sharma,K S Rajan

    Related Publications


      Downloads

      thesis

      Enhancing OCR Performance with Low Supervision


      Deepayan Das

      Abstract

      Over the last decade, a tremendous emphasis has been laid on collection and digitization of a vast number of books leading to the creation of so-called ‘Digital Libraries’. Projects like Google Book and Project Gutenberg have made significant progress in digitizing over millions of books and making it available to the public. Efforts have also been made from the perspective of Indic languages where the task to identify and recognize books from several Indian languages has been undertaken by the National Digital Library of India. Advantages of digital libraries can be manifold. Digitization of ancient manuscripts ensures the preservation of knowledge and promotes research. Books in digital libraries are indexed which facilitates easy search and retrieval. They are easy to store and do not take as much effort in maintenance as their physical counterparts. One of the most important steps in the digitization effort is the recognition and conversion of physical pages into editable text using an OCR. There are commercial OCRs available like Tesseract and Abby fine reader, however, the ability of an OCR to recognize text without committing too many errors depends very much on the print quality of the pages as well as font style of the type-written text. A pre-trained OCR will invariably make errors across pages whose distribution is different in terms of fonts and print quality from the pages on which it was trained. If the domain gap is too large then the number of error words will be too high which will result in investing significant effort in the correction. Since the books need to be indexed, one cannot afford to have too many word errors in the OCR recognized pages. Thus, a major effort must be spent on correcting the error words, misclassified by the OCR. Manually correcting each isolated error word will incur a huge cost and is infeasible. In this thesis, we look at methods to improve OCR accuracy with minimum human involvement. To this effect, we propose two approaches. In the first approach, we strive to improve the OCR performance via an efficient post- processing technique where we aim to group similar erroneous words and correct them simultaneously. We argue that since a book has a common underlying theme, it will contain many word repetitions. These word co-occurrences can be taken advantage of by grouping similar error words and correcting them in batches. We propose a novel clustering scheme which combines features from both images as well as its text transcription to group error word predictions. The grouped error predictions can then be corrected either automatically or with the help of a human annotator. We show via experimental verification that automatic correction of error word batches might not be the most efficient way to correct the error words and employing a human annotator to verify the error word clusters will be a more systematic way to address the issue. Next, we look at the problem of adapting an OCR to a new dataset without requiring too many annotated pages. Traditional norm dictates finetuning the existing OCR on a portion of target data. However, even annotating a portion of data to create image-label pairs can be a costly affair. For this, we employ a self-training approach where the OCR is finetuned on its own predictions from the target dataset. To curtail the effects of noise present in the predictions, we include only those samples in the training set on which the model is sufficiently confident. We also show that by employing various regularization strategies we can outperform the traditional finetuning method without the need for any additional labelled data. We further show that by combining self-training with finetuning we can achieve a maximum gain in terms of OCR accuracy across all the datasets. We furnish thorough empirical evidence to support all our claims.

      Year of completion:  March 2021
       Advisor : C V Jawahar

      Related Publications


        Downloads

        thesis

        Investigation of Different Aspects of Image Quality


        Murtuza Bohra

        Abstract

        Image quality is a fundamental problem in computer vision. For variety of applications, for instance scanning documents, QR codes, bar codes or algorithms like object detection, recognition, tracking, scene understanding etc. images with good contrast, high illumination and sharpness are desired. Similarly in computer graphics for information visualisation, animations, presentations etc. aesthetically pleasing design and good colorization of the images are desired. Therefore the definition of image quality depend on the context and application of the image. In this thesis we attempt to address various challenges pertaining to image quality, (1) for natural imaging, we explore a novel approach for predicting the capture quality of the images taken in the wild. (2) For Graphics designs, we explore the aesthetic quality of images by suggesting multiple aesthetically pleasing colorization of graphics designs. Due to increasing advancements and portability of smartphone cameras, it has become a default choice for capturing images in the wild. However, there are quality issues with camera captured images due to reasons like lack of stability during capture process. This hinders the automatic workflows which takes camera captured images as input e.g. Optical Character Recognition (OCR) for documents image, face detection/recognition from human image etc. Part of this thesis is focused on Image Quality Assessment (IQA), the aim is to quantify the degradation like out-of-focus blur and motion artefacts in a given image. One of the major challenge in IQA for images captured in the wild is that, we do not have ground truth to measure the capture quality. Therefore various previous attempts of IQA require human in loop for creating the ground truths for capture quality of images. Large user studies are conducted and mean human opinion scores are then used as measure for quality. In this work we use a signal processing based technique to generate the IQA ground truth, and propose a comprehensive IQA dataset which is a good representative of the real degradation during the process of capture. Further, we propose deep learning based approach to predict image quality for captures in the wild. Such IQA algorithm can be helpful in the cause by either giving online quality suggestion during capture or rating the quality post capture. Another dimension to the image quality is aesthetic quality. Increasing usage of internet, social media and advancement in the mobile camera, photography has become a very popular hobby and interest to a large section. Even for a well captured image, people use varieties of filters and effects post capture to enhance the appearance of the image e.g. adjusting color temperature, contrast or even blurring part of vi vii the image (bokeh effect) etc. Therefore the capture quality is not enough to define the aesthetic quality of the image. However, one of the major factors defining the aesthetic quality of image is colorization. Particularly in computer graphics domain, where artificially generated images already have well defined structures, shape and components. Therefore sharpness or capture quality are not relevant, but on the other hand, color quality plays a very important role in visualization and appearance of the image. In natural images, largely colors are associated with semantics e.g. sky is always blue or grass are green etc. whereas in animations and computer graphics where objects are loosely associated with semantics, this lead to more choices of colors. Hence the problem of colorization becomes more challenging in graphics domain, where overall appearance of the images are more than its naturalness. Therefore, this work also covers aesthetic quality of graphics images, here instead of measuring the color quality, we propose algorithm to produce better coloring suggestions for the given graphics images.

        Year of completion:  March 2021
         Advisor : Vineet Gandhi

        Related Publications


          Downloads

          thesis

          Deep Learning Frameworks for Human Motion Analysis


          Neeraj Battan

          Abstract

          The human body is a complex structure consisting of multiple organs, muscles and bones. However, in regard to modeling human motion, the information about fixed set of joint locations (with fixed bone length constraint) is sufficient to express the temporal evolution of poses. Thus, one can represent a human motion/activity as temporally evolving skeleton sequence. In this thesis, we primarily explore two key problems related to modeling of human motion, the first is efficient indexing & retrieval of human motion sequences and the second one is automated generation/synthesis of novel human motion sequences given input class prior. 3D Human Motion Indexing & Retrieval is an interesting problem due to the rise of several data-driven applications aimed at analyzing and/or re-utilizing 3D human skeletal data, such as data-driven animation, analysis of sports biomechanics, human surveillance, etc. Spatio-temporal articulations of humans, noisy/missing data, different speeds of the same motion, etc. make it challenging and several of the existing states of the art methods use hand-craft features along with optimization-based or histogrambased comparison in order to perform retrieval. Further, they demonstrate it only for very small datasets and few classes. We make a case for using a learned representation that should recognize the motion as well as enforce a discriminative ranking. To that end, we propose, a 3D human motion descriptor learned using a deep network. Our learned embedding is generalizable and applicable to real-world data - addressing the aforementioned challenges and further enables sub-motion searching in its embedding space using another network. Our model exploits the inter-class similarity using trajectory cues and performs far superior in a self-supervised setting. State of the art results on all these fronts is shown on two large scale 3D human motion datasets - NTU RGB+D and HDM05. In regard to the second research problem of Human Motion Generation/Synthesis, we aimed at long-term human motion synthesis that can aid to human-centric video generation [9] with potential applications in Augmented Reality, 3D character animations, pedestrian trajectory prediction, etc. Longterm human motion synthesis is a challenging task due to multiple factors like long-term temporal dependencies among poses, cyclic repetition across poses, bi-directional and multi-scale dependencies among poses, variable speed of actions, and a large as well as partially overlapping space of temporal pose variations across multiple class/types of human activities. This paper aims to address these challenges to synthesize a long-term (> 6000 ms) human motion trajectory across a large variety of human activity 1Video for Human Motion Synthesis. 2Video for Human Motion Indexing and Retrieval. classes (> 50). We propose a two-stage motion synthesis method to achieve this goal, where the first stage deals with learning the long-term global pose dependencies in activity sequences by learning to synthesize a sparse motion trajectory while the second stage addresses the synthesis of dense motion trajectories taking the output of the first stage. We demonstrate the superiority of the proposed method over SOTA methods using various quantitative evaluation metrics on publicly available datasets. In summary, this thesis successfully addressed two key research problems, namely, efficient human motion indexing & retrieval and long-term human motion synthesis. In doing so, we explore different machine learning techniques that analyze the human motion in a sequence of poses (or called as frames), generate human motion, detect human pose from an RGB image, and index human motion.

          Year of completion:  March 2021
           Advisor : Anoop M Namboodiri,Madhava Krishna

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Behaviour Detection of Vehicles Using Graph Convolutional Networks with Attention
            2. Computer Vision for Atmospheric Turbulence: Generation, Restoration, and its Applications
            3. Monocular 3D Human Body Reconstruction
            4. Neural and Multilingual Approaches to Machine Translation for IndianLanguages and its Applications
            • Start
            • Prev
            • 15
            • 16
            • 17
            • 18
            • 19
            • 20
            • 21
            • 22
            • 23
            • 24
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. MS Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.