Deep Neural Models for Generalized Synthesis of Multi-Person Actions

Debtanu Gupta

Abstract

The ability to synthesize novel and diverse human motion at scale is indispensable not only to the umbrella field of computer vision but in multitudes of allied fields such as animation, human computer interaction, robotics and human robot interaction. Over the years, various approaches have been proposed including physics-based simulation, key-framing, database methods, etc. But ever since the renaissance of deep learning and the rapid development of computing, the generation of synthetic human motion using deep learning based methods have received significant attention. Apart from pixel-based video data, the availability of reliable motion capture systems has enabled pose-based human action synthesis. Much of it is owed to the development of frugal motion capture systems, which enabled the curation of large scale skeleton action datasets. In this thesis, we focus on skeleton-based human action generation. To begin with, we study an approach for large-scale skeleton-based action generation. In doing so, we introduce MUGL, a novel deep neural model for large-scale, diverse generation of single and multiperson pose-based action sequences with locomotion. Our controllable approach enables variable-length generations customizable by action category, across more than 100 categories. To enable intra/intercategory diversity, we model the latent generative space using a Conditional Gaussian Mixture Variational Autoencoder. To enable realistic generation of actions involving locomotion, we decouple local pose and global trajectory components of the action sequence. We incorporate duration-aware feature representations to enable variable-length sequence generation. We use a hybrid pose sequence representation with 3D pose sequences sourced from videos and 3D Kinect-based sequences of NTU-RGBD120. To enable principled comparison of generation quality, we employ suitably modified strong baselines during evaluation. Although smaller and simpler compared to baselines, MUGL provides better quality generations, paving the way for practical and controllable large-scale human action generation. Further, we study the approaches for methods that are generalizable across datasets with varying properties and we also study methods for dense skeleton action generation. In this backdrop, we introduce DSAG, a controllable deep neural framework for action-conditioned generation of full body multi-actor variable duration actions. To compensate for incompletely detailed finger joints in existing large-scale datasets, we introduce full body dataset variants with detailed finger joints. To overcome shortcomings in existing generative approaches, we introduce dedicated representations for encoding finger joints. We also introduce novel spatiotemporal transformation blocks with multi-head self attention and specialized temporal processing. The design choices enable generations for a large range in body joint counts (24 - 52), frame rates (13 - 50), global body movement (in-place, locomotion) and action categories (12 - 120), across multiple datasets (NTU-120, HumanAct12, UESTC, Human3.6M). Our experimental results demonstrate DSAG’s significant improvements over state-of-the-art, its suitability for action-conditioned generation at scale and also for the challenging task of long-term motion prediction.

Year of completion:	December 2022
Advisor :	Ravi Kiran Sarvadevabhatla