CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Deep Learning Frameworks for Human Motion Analysis


Neeraj Battan

Abstract

The human body is a complex structure consisting of multiple organs, muscles and bones. However, in regard to modeling human motion, the information about fixed set of joint locations (with fixed bone length constraint) is sufficient to express the temporal evolution of poses. Thus, one can represent a human motion/activity as temporally evolving skeleton sequence. In this thesis, we primarily explore two key problems related to modeling of human motion, the first is efficient indexing & retrieval of human motion sequences and the second one is automated generation/synthesis of novel human motion sequences given input class prior. 3D Human Motion Indexing & Retrieval is an interesting problem due to the rise of several data-driven applications aimed at analyzing and/or re-utilizing 3D human skeletal data, such as data-driven animation, analysis of sports biomechanics, human surveillance, etc. Spatio-temporal articulations of humans, noisy/missing data, different speeds of the same motion, etc. make it challenging and several of the existing states of the art methods use hand-craft features along with optimization-based or histogrambased comparison in order to perform retrieval. Further, they demonstrate it only for very small datasets and few classes. We make a case for using a learned representation that should recognize the motion as well as enforce a discriminative ranking. To that end, we propose, a 3D human motion descriptor learned using a deep network. Our learned embedding is generalizable and applicable to real-world data - addressing the aforementioned challenges and further enables sub-motion searching in its embedding space using another network. Our model exploits the inter-class similarity using trajectory cues and performs far superior in a self-supervised setting. State of the art results on all these fronts is shown on two large scale 3D human motion datasets - NTU RGB+D and HDM05. In regard to the second research problem of Human Motion Generation/Synthesis, we aimed at long-term human motion synthesis that can aid to human-centric video generation [9] with potential applications in Augmented Reality, 3D character animations, pedestrian trajectory prediction, etc. Longterm human motion synthesis is a challenging task due to multiple factors like long-term temporal dependencies among poses, cyclic repetition across poses, bi-directional and multi-scale dependencies among poses, variable speed of actions, and a large as well as partially overlapping space of temporal pose variations across multiple class/types of human activities. This paper aims to address these challenges to synthesize a long-term (> 6000 ms) human motion trajectory across a large variety of human activity 1Video for Human Motion Synthesis. 2Video for Human Motion Indexing and Retrieval. classes (> 50). We propose a two-stage motion synthesis method to achieve this goal, where the first stage deals with learning the long-term global pose dependencies in activity sequences by learning to synthesize a sparse motion trajectory while the second stage addresses the synthesis of dense motion trajectories taking the output of the first stage. We demonstrate the superiority of the proposed method over SOTA methods using various quantitative evaluation metrics on publicly available datasets. In summary, this thesis successfully addressed two key research problems, namely, efficient human motion indexing & retrieval and long-term human motion synthesis. In doing so, we explore different machine learning techniques that analyze the human motion in a sequence of poses (or called as frames), generate human motion, detect human pose from an RGB image, and index human motion.

Year of completion:  March 2021
 Advisor : Anoop M Namboodiri,Madhava Krishna

Related Publications


    Downloads

    thesis

    Behaviour Detection of Vehicles Using Graph Convolutional Networks with Attention


    Sravan Mylavarapu

    Abstract

    Autonomous driving has evolved with the advent of deep learning and increased computational capacity in the past two decades. While many tasks have evolved that help autonomous navigation, understanding on-road vehicle behaviour from a continuous sequence of sensor data (camera or radar) is an important and useful task. Determining state of other vehicles and their motion helps in decision making for the ego-vehicle (Self/Our Vehicle). Such systems can be a part of driver assistance systems or even path planners for vehicles to avoid obstacles. Instances of such decision making could be, deciding to apply brakes incase an external agent is moving into our lane or coming opposite to us. Another instance would be keeping distance from an aggressive driver who is changing lanes and overtaking frequently could be helpful and safer. Specifically, our proposed methods classify the behaviours of on-road vehicles into one of the following categories-{Moving Away, Moving Towards Us, Parked/stationary, Lane Change, Overtake}. Many current methods leverage 3D depth information using expensive equipment, such as Lidar and complex algorithms. In this thesis, we propose a simpler pipeline for understanding vehicle behaviour from a monocular (single camera) image sequence or video. This will help in easier installation of the equipment and helps in Driver Assistance systems. A simple video along with Camera parameters, scene semantics (Detected Objects in the images) are used to get information about the objects of interest (vehicles) and other static objects like lanes/poles in the scene. We consider detecting objects across frames as the pre-processing pipe-line and propose two main methods, 1.Spatio-temporal MRGCN (Chapter:3) and 2.Relational-Attentive GCN (Chapter:4) for behaviour detection of vehicles. We divide our process of identifying behaviours into learning positional information of vehicles first, and then observing them across time to make a prediction. The positional information is encoded by a Multi-Relational Graph Convolutional Network (MR-GCN) in both the methods. Temporal information is encoded by recurrent networks in Method 1, while it is formulated inside the graph in Method 2. We also showcase how Attention is an important aspect in both our methods through our experiments. The proposed frameworks can classify a variety of vehicle behaviours to high fidelity on datasets that are diverse and include European, Chinese and Indian on-road scenes. The framework also provides for seamless transfer of models across datasets without entailing re-annotation, retraining and even fine-tuning. We provide comparative performance gain over baselines and detail a variety of ablations to showcase the efficacy of the framework.

    Year of completion:  February 2021
     Advisor : Anoop M Namboodiri,Madhava Krishna

    Related Publications


      Downloads

      thesis

      Computer Vision for Atmospheric Turbulence: Generation, Restoration, and its Applications


      Shyam Nandan Rai

      Abstract

      Real-world images often suffer from variation in weather conditions such as rain, fog, snow, and temperature. These variations in the atmosphere adversely affect the performance of computer vision models in real-world scenarios. This problem can be bypassed by collecting and annotating images for each weather. However, collecting and annotating images in such conditions is an extremely tedious task, which is time-consuming as well as expensive. So, in this work, we address the forementioned problems. Among all the weather conditions, we focus on the distortions in the image caused by high temperature, also known as atmospheric turbulence. These distortions introduce geometrical deformation around the boundaries of an object in an image which causes a vision algorithm to perform poorly and pose a major concern. Hence, in this thesis, we address the problem of artificially generating atmospheric turbulence and restoring the images from it. In the first part of our work, we attempt to model atmospheric turbulence. Since such models are critical to extending computer vision solutions developed in the laboratory to real-world use cases. And, simulating atmospheric turbulence by using statistical models or by computer graphics is often computationally expensive. To overcome this problem, we train a generative adversarial network(GAN) which outputs an atmospheric turbulent image by utilizing less computational resources than traditional methods. We propose a novel loss function to efficiently learn the atmospheric turbulence at the finer level. Experiments show that by using the proposed loss function, our network outperforms the existing state-of-the-art image to image translation network in turbulent image generation. In the second part of the thesis, we address the ill-posed problem of restoring images degraded due to atmospheric turbulence. We propose a deep adversarial network to recover the images which are distorted due to atmospheric turbulence and show the applicability of restored images in several tasks. Unlike previous methods, our approach neither uses any prior knowledge about atmospheric turbulence conditions at inference time nor requires the fusion of multiple images to get a single restored image. To train our models, we synthesized turbulent images by following a series of efficient 2D operations. Thereafter, using our trained models we run inference on real and synthesized turbulent images. Our final restoration models DT-GAN+ and DTD-GAN+ qualitatively and quantitatively outperforms the general state-of-the-art image-to-image translation models. The improved performance of our model is due to the use of optimized residual structures along with channel attention and sub-pixel mechanism which exploits the information between the channels and removes atmospheric turbulence at the finer level. We also perform extensive experiments on restored images by utilizing them for downstream tasks such as classification, pose estimation, semantic keypoint estimation, and depth estimation. In the third part of our work, we study the problem of the semantic segmentation model in adapting to hot climate cities. This issue can be circumvented by collecting and annotating images in such weather conditions and training segmentation models on those images. But, the task of semantically annotating images for every environment is painstaking and expensive. Hence, we propose a framework that improves the performance of semantic segmentation models without explicitly creating an annotated dataset for such adverse weather variations. Our framework consists of two parts, a restoration network to remove the geometrical distortions caused by hot weather and an adaptive segmentation network that is trained on an additional loss to adapt to the statistics of the ground-truth segmentation map. We train our framework on the Cityscapes dataset, which showed a total IoU gain of 12.707 over standard segmentation models. In the last part of our work, we improve the performance of our joint restoration and segmentation network via a feedback mechanism. In, the previous approach the restoration network does not learn directly from the errors of the segmentation network. In other words, the restoration network is not task aware. Hence, we propose a semantic feedback learning approach, which improves the task of semantic segmentation giving a feedback response into the restoration network. This response works as an attend and fix mechanism by focusing on those areas of an image where restoration needs improvement. Also, we proposed loss functions: Iterative Focal Loss (iFL) and Class-Balanced Iterative Focal Loss (CBiFL), which are specifically designed to improve the performance of the feedback network. These losses focus more on those samples that are continuously miss-classified over successive iterations. Our approach gives a gain of 17.41 mIoU over the standard segmentation model, including the additional gain of 1.9 mIoU with CB-iFL on the Cityscapes dataset

      Year of completion:  December 2020
       Advisor : C V Jawahar,Vineeth Balasubramanian,Anbumani Subramanian

      Related Publications


        Downloads

        thesis

        Monocular 3D Human Body Reconstruction


        Abbhinav Venkat

        Abstract

        Monocular 3D human reconstruction is a very relevant problem due to numerous applications to the entertainment industry, e-commerce, health care, mobile-based AR/VR platforms, etc. However, it is severely ill-posed due to self-occlusions from complex body poses and shapes, clothing obstructions, lack of surface texture, background clutter, single view, etc. Conventional approaches address these challenges by using different sensing systems - marker-based, marker-less multi-view cameras, inertial sensors, and 3D scanners. Although effective, such methods are often expensive and have limited wide-scale applicability. In an attempt to produce scalable solutions, a few have focused on fitting statistical body models to monocular images, but are susceptible to the costly optimization process. Recent efforts focus on using data-driven algorithms such as deep learning to learn priors directly from data. However, they focus on template model recovery, rigid object reconstruction, or propose paradigms that don’t directly extend to recovering personalized models. To predict accurate surface geometry, our first attempt was VolumeNet, which predicted a 3D occupancy grid from a monocular image. This was the first of its kind model for non-rigid human shapes at that time. To circumvent the ill-posed nature of this problem (aggravated by an unbounded 3D representation), we follow the ideology of providing maximal training priors with our unique training paradigms, to enable testing with minimal information. As we did not impose any body-model based constraint, we were able to recover deformations induced by free-form clothing. Further, we extended VolumeNet to PoShNet by decoupling Pose and Shape, in which we learn the volumetric pose first, and use it as a prior for learning the volumetric shape, thereby recovering a more accurate surface. Although volumetric regression enables recovering a more accurate surface reconstruction, they do so without an animatable skeleton. Further, such methods yield reconstructions of low resolution at higher computational cost (regression over the cubic voxel grid) and often suffer from an inconsistent topology via broken or partial body parts. Hence, statistical body models become a natural choice to offset the ill-posed nature of this problem. Although theoretically, they are low dimensional, learning such models has been challenging due to the complex non-linear mapping from the image to the relative axis-angle representation. Hence, most solutions rely on different projections of the underlying mesh (2D/3D keypoints, silhouettes, etc.). To simplify the learning process, we propose the CR framework that uses classification as a prior for guiding the regression’s learning process. Although recovering personalized models with high-resolution meshes isn’t a possibility in this space, the framework shows that learning such template models can be difficult without additional supervision. As an alternative to directly learning parametric models, we propose HumanMeshNet to learn an “implicitly structured point cloud”, in which we make use of the mesh topology as a prior to enable better learning. We hypothesize that instead of learning the highly non-linear SMPL parameters, learning its corresponding point cloud (although high dimensional) and enforcing the same parametric template topology on it is an easier task. This proposed paradigm can theoretically learn local surface deformations that the body model based PCA space can’t capture. Further, going ahead, attempting to produce highresolution meshes (with accurate geometry details) is a natural extension that is easier in 3D space than in the parametric one. In summary, in this thesis, we attempt to address several of the aforementioned challenges and empower machines with the capability to interpret a 3D human body model (pose and shape) from a single image in a manner that is non-intrusive, inexpensive and scalable. In doing so, we explore different 3D representations that are capable of producing accurate surface geometry, with a long-term goal of recovering personalized 3D human models.

        Year of completion:  November 2020
         Advisor : Avinash Sharma

        Related Publications


          Downloads

          thesis

          Neural and Multilingual Approaches to Machine Translation for IndianLanguages and its Applications


          Jerin Philip

          Abstract

          Neural Machine Translation (NMT), together with multilingual formulations have arisen as the de-facto standard in translating a sentence from a source language to a target language. However, unlike many western languages, the available resources like training data of parallel sentences ortrained models which can be used to build and demonstrate applications in other domains are limited for the languagesin the Indian subcontinent. This work takes a major step towards closing this gap.In this work, we describe the development of state-of-the art translation solutions for 10 Indian lan-guages and English. We do this in four parts described below:1.Considering the Hindi-English language pair we successfully develop an NMT solution for anarrow-domain, demonstrating its application in translating cricket commentary.2.Through heavy data augmentation, we extend the above to the general domain and build a state-of-the art MT system for Hindi-English language pair. Further, We extend to five more languagesby taking advantage of multiway formulations.3.WedemonstratetheapplicationoftheNMTincontributingmoreresourcestothealreadyresource-scarce field, expanding to 11 langauges and its application in a multimodal task of translating a talking face to a target language with lip synchronization.4.Next, we improve both data-situation and performance for machine translation in 11 Indian Lan-guages iteratively to place our models in a standardized, comparable set of metrics setting up forfuture advances in the space to comprehensively evaluate and compare against.

          Year of completion:  August 2020
           Advisor : C V Jawahar, Vinay P. Namboodiri

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Lip-syncing Videos In The Wild
            2. Multiscale Two-view Stereo using Convolutional Neural Networks for Unrectified Images
            3. SemanticEdge Labeling usingDepth cues
            4. A New Algorithm for Ray Tracing Synthetic Light Fields on the GPU
            • Start
            • Prev
            • 15
            • 16
            • 17
            • 18
            • 19
            • 20
            • 21
            • 22
            • 23
            • 24
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.