CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Deep Neural Models for Generalized Synthesis of Multi-Person Actions


Debtanu Gupta

Abstract

The ability to synthesize novel and diverse human motion at scale is indispensable not only to the umbrella field of computer vision but in multitudes of allied fields such as animation, human computer interaction, robotics and human robot interaction. Over the years, various approaches have been proposed including physics-based simulation, key-framing, database methods, etc. But ever since the renaissance of deep learning and the rapid development of computing, the generation of synthetic human motion using deep learning based methods have received significant attention. Apart from pixel-based video data, the availability of reliable motion capture systems has enabled pose-based human action synthesis. Much of it is owed to the development of frugal motion capture systems, which enabled the curation of large scale skeleton action datasets. In this thesis, we focus on skeleton-based human action generation. To begin with, we study an approach for large-scale skeleton-based action generation. In doing so, we introduce MUGL, a novel deep neural model for large-scale, diverse generation of single and multiperson pose-based action sequences with locomotion. Our controllable approach enables variable-length generations customizable by action category, across more than 100 categories. To enable intra/intercategory diversity, we model the latent generative space using a Conditional Gaussian Mixture Variational Autoencoder. To enable realistic generation of actions involving locomotion, we decouple local pose and global trajectory components of the action sequence. We incorporate duration-aware feature representations to enable variable-length sequence generation. We use a hybrid pose sequence representation with 3D pose sequences sourced from videos and 3D Kinect-based sequences of NTU-RGBD120. To enable principled comparison of generation quality, we employ suitably modified strong baselines during evaluation. Although smaller and simpler compared to baselines, MUGL provides better quality generations, paving the way for practical and controllable large-scale human action generation. Further, we study the approaches for methods that are generalizable across datasets with varying properties and we also study methods for dense skeleton action generation. In this backdrop, we introduce DSAG, a controllable deep neural framework for action-conditioned generation of full body multi-actor variable duration actions. To compensate for incompletely detailed finger joints in existing large-scale datasets, we introduce full body dataset variants with detailed finger joints. To overcome shortcomings in existing generative approaches, we introduce dedicated representations for encoding finger joints. We also introduce novel spatiotemporal transformation blocks with multi-head self attention and specialized temporal processing. The design choices enable generations for a large range in body joint counts (24 - 52), frame rates (13 - 50), global body movement (in-place, locomotion) and action categories (12 - 120), across multiple datasets (NTU-120, HumanAct12, UESTC, Human3.6M). Our experimental results demonstrate DSAG’s significant improvements over state-of-the-art, its suitability for action-conditioned generation at scale and also for the challenging task of long-term motion prediction.

Year of completion:  December 2022
 Advisor : Ravi Kiran Sarvadevabhatla

Related Publications


    Downloads

    thesis

    Visual Grounding for Multi-modal Applications


    Kanishk Jain

    Abstract

    The task of Visual Grounding is at the intersection of computer vision and natural language processing tasks. The Visual Grounding (VG) task requires spatially localizing an entity in a visual scene based on its linguistic description. The capability to ground language in the visual domain is of significant importance for many real-world applications, especially for human-machine interaction. One such application is language-guided navigation, where the navigation of autonomous vehicles is modulated using a linguistic command. The VG task is intimately linked with the task of vision-language navigation (VLN), as both the tasks require reasoning about the linguistic command and the visual scene simultaneously. Existing approaches to VG can be divided into two categories based on the type of localization performed: (1) bounding-box/proposal-based localization and (2) pixel-level localization. This work focuses on pixel-level localization, where the segmentation mask corresponding to the entity/region referred to by the linguistic expression is predicted. The research in this thesis focuses on a novel modeling strategy for visual and linguistic modalities for the VG task, followed by the first-ever visual grounding based approach to the VLN task. We first present a novel architecture for the task of pixel-level localization, also known as Referring Image Segmentation (RIS). The architecture is based on the hypothesis that both intra-modal (wordword and pixel-pixel) and inter-modal (word-pixel) interactions are required to identify the referred entity successfully. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intra-modal interactions. We address this limitation by performing all three interactions synchronously in a single step. We validate our hypothesis empirically against existing methods and achieve State-Of-the-Art results on RIS benchmarks. Finally, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from RIS, which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. We additionally introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset with segmentation masks for the regions described by the linguistic commands. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.

    Year of completion:  December 2022
     Advisor : Vineet Gandhi

    Related Publications


      Downloads

      thesis

      I-Do, You-Learn: Techniques for Unsupervised Procedure Learning using Egocentric Videos


      Siddhant Bansal

      Abstract

      Consider an autonomous agent capable of observing multiple humans making a pizza and making one the next time! Motivated to contribute towards creating systems capable of understanding and reasoning instructions at the human level, in this thesis, we tackle procedure learning. Procedure learning involves identifying the key-steps and determining their logical order to perform a task. The first portion of this thesis focuses on the datasets curated for procedure learning. Existing datasets commonly consist of third-person videos for learning the procedure, making the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action. To this end, for studying procedure learning from egocentric videos, we propose the EgoProceL dataset. However, procedure learning from egocentric videos is challenging because the camera view undergoes extreme changes due to the wearer’s head motion and introduces unrelated frames. Due to this, current state-of-the-art methods’ assumptions that the actions occur at approximately the same time and are of the same duration do not hold. Instead, we propose to use the signal provided by the temporal correspondences between key-steps across videos. To this end, we present a novel self-supervised Correspond and Cut (CnC) framework that identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. We perform experiments on the benchmark ProceL and CrossTask datasets and achieve state-of-the-art results. In the second portion of the thesis, we look at various approaches to generate the signal for learning the embedding space. Existing approaches use only one or a couple of videos for this purpose. However, we argue that it makes key-steps discovery challenging as the algorithms lack an inter-videos perspective. To this end, we propose an unsupervised Graph-based Procedure Learning (GPL) framework. GPL consists of the novel UnityGraph that represents all the videos of a task as a graph to obtain both intra-video and inter-videos context. Further, to obtain similar embeddings for the same key-steps, the embeddings of UnityGraph are updated in an unsupervised manner using the Node2Vec algorithm. Finally, to identify the key-steps, we cluster the embeddings using KMeans. We test GPL on benchmark ProceL, CrossTask, and EgoProceL datasets and achieve an average improvement of 2% on third-person datasets and 3.6% on EgoProceL over the state-of-the-art. We hope this work motivates future research on procedure learning from egocentric videos. Furthermore, the unsupervised approaches proposed in the thesis will help create scalable systems and drive future research toward creative solutions

      Year of completion:  February 2023
       Advisor : C V Jawahar, Chetan Arora

      Related Publications


        Downloads

        thesis

        Continual and Incremental Learning in Computer-aided Diagnosis Systems


        Prathyusha Akundi

        Abstract

        Deep Neural Networks (DNNs) have shown remarkable performance in a broad range of computer vision tasks, including in the medical domain. With the advent of DNNs, the medical community has witnessed significant developments in segmentation, classification, and detection. But this success comes with a cost of heavy reliance on the abundance of data. Medical data, however, is often highly limited in volume and quality due to sparsity of patient contact, variability in medical care, and privacy concerns. Hence, to train large networks we seek data from different sources. In such a scenario, it is of interest to design a model that learns continuously and adapts to datasets or tasks as and when they are available. However, one of the important steps to achieve such a never-ending learning process is to overcome Catastrophic Forgetting (CF) of previously seen data or tasks. CF refers to the significant degradation in performance on the old task/dataset. To avoid confusion, we call a training regime Continual Learning (CL) when CAD systems have to handle a sequence of datasets collected over time from different sites with different imaging parameters/populations. Similarly, Incremental Learning (IL) is when CAD systems have to learn new classes as and when new annotations are made available. The work described in this thesis address core aspects of both CL & IL and has been compared against the state-of-the-art methods. In this thesis, we assume that access to the data belonging to previously trained datasets or tasks is not available which makes both CL and IL processes even more challenging. We start with developing a CL system that learns sequentially on different datasets and handles CF using the Uncertainty mechanism. The system consists of an ensemble of models which are trained or finetuned on each dataset and considers the prediction from the model which has the least uncertainty. We then investigate a new way to tackle CF in CL by manifold learning, inspired by the defense mechanisms against adversarial attacks. Our method uses a ‘Reformer’ which is essentially a denoising autoencoder that ‘reforms’ or brings the data from all the datasets together towards a common manifold. These reformed samples are then passed to the network to learn the desired task. Towards IL, we propose a novel approach that ensures that a model remembers the causal factor behind the decisions on the old classes, while incrementally learning new classes. We introduce a common auxiliary task during the course of incremental training, whose hidden representations are shared across all the classification heads. All the experiments for both CL and IL are conducted on multiple datasets and have shown significant performance over the state-of-the-art methods

        Year of completion:  April 2023
         Advisor : Jayanthi Sivaswamy

        Related Publications


          Downloads

          thesis

          3D Interactive Solution for Neuroanatomy Education


          Mythri V

          Abstract

          Typically, anatomy is taught through dissection, 2D images, presentations, and cross-sections. While these methods are convenient, they are non-interactive and fail to capture the spatial relationships and functional aspects of anatomy. New visualization technologies such as virtual reality and 3D can compensate for the impediments and provide better understanding while captivating the students. With recent advances in the industry, the methods to provide a 3D experience are economical. In this thesis, we introduce a low-cost 3D-interactive anatomy system designed for an audience of typical medical college students in an anatomy class. The setup used to achieve 3D visualization is Dual projector polarization. While there are other ways to achieve 3D visualization, like alternate frame sequencing and virtual reality, this technique can target a large audience and requires minimum accessories for the setup enabling this to be a low-cost solution for an immersive 3D experience. The 3D interactive Neuroanatomy solution is an end-to-end framework capable of designing anatomy lessons and visualizing the 3D stereoscopic projection of those anatomy lessons. To ensure superior comprehension of students, we incorporate each teacher’s unique teaching approach while developing anatomy lessons by providing the ability to create their own lessons. We have created anatomy lessons based on the human brain which is a vital organ and has a complex anatomy. Our aim is to help medical students to understand the complexity of organ systems from not just an anatomical perspective but also a radiological perspective. We use annotations on clinical case data such as MRI, MRA, etc., to create 3D models for anatomy visualization incorporating clinical information and illustrating real cases. Annotations for structures of interest are done using manual, automatic, and semi-automatic segmentation methods. Manual delineation of the structure boundaries is very tedious and time-consuming. Automatic segmentation is quick and convenient. However, manual annotations were done for the 3D anatomy viewer for small and complex structures due to substandard automatic segmentation. There is a need to improve automatic segmentation performance for those structures. While segmentation is an essential step in 3D modeling, it plays a critical role in many neurological disease diagnoses as well, which are associated with degradation in the sub-cortical region. Therefore accurate algorithms are needed for sub-cortical structure segmentation. Variance in the size of structures is significant, which introduces a performance bias towards larger structures in many deep learning approaches. In this part of the thesis, we aim to remove size bias in sub-cortical structure segmentation. The proposed method addresses this problem with a pre-training step used to learn tissue characteristics, an ROI extraction step that aids in focusing on the local context, and using structure ROIs elevates the influence of smaller structures in the network.

          Year of completion:  April 2023
           Advisor : Jayanthi Sivaswamy

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Towards Generalization in Multi-View Pedestrian Detection
            2. Does Audio help in deep Audio-Visual Saliency prediction models?
            3. Development of Annotation Guidelines, Datasets and Deep Networks for Palm Leaf Manuscript Layout Understanding
            4. Situation Recognition for Holistic Video Understanding
            • Start
            • Prev
            • 6
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            • 13
            • 14
            • 15
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.