Estimating 3D Human Pose, Shape, and Correspondences from Monocular Input

Amogh Tiwari

Abstract

In recent years, advances in computer vision have opened up multiple applications in virtual reality, healthcare, robotics, and many other domains. One crucial problem domain in computer vision, which has been a key research focus lately, is estimating the 3D human pose, shape, and correspondences from monocular input. This problem domain has applications in various industries like fashion, entertainment, healthcare, etc. However, it is also highly challenging due to various reasons like large variations in the pose, shape, and appearance of humans and clothing details, external and self-occlusions, challenges with ensuring consistency etc. As part of this thesis, we tackle two key problems related to 3D human pose, shape, and correspondence estimation. First, we focus on the problem of temporally consistent 3D human pose and shape estimation from monocular videos. Next, we focus on dense correspondence estimation across images of different (or the same) humans. We show that despite receiving a lot of research attention lately, existing methods for these tasks still perform sub-optimally in many challenging scenarios and have significant scope for improvement. We aim to overcome some of the limitations of existing methods and advance state-of-the-art (SOTA) solutions to these problems. First, we propose a novel method for temporally consistent 3D human pose and shape estimation from a monocular video. Instead of using the traditionally used, generic ResNet-like features, our method uses a body-aware feature representation and an independent per-frame pose and camera initialization over a temporal window followed by a novel spatio-temporal feature aggregation by using a combination of self-similarity and self-attention over the body-aware features and the per-frame initialization. Together, they yield enhanced spatio-temporal context for every frame by considering the remaining past and future frames. These features are used to predict the pose and shape parameters of the human body model, which are further refined using an LSTM. Next, we expand our focus to the task of dense correspondence estimation between humans, which requires understanding the relations between different body regions (represented using dense correspondences), including the clothing details, of the same or different human(s). We present Continuous Volumetric Embeddings (ConVol-E), a novel robust representation for dense correspondence-matching across RGB images of different human subjects in arbitrary poses and appearances under non-rigid deformation scenarios. Unlike existing representations, ConVol-E captures the deviation from the underlying parametric body model by choosing suitable anchor/key points on the underlying parametric body surface and then representing any point in the volume based on its Euclidean relationship with the anchor points. This allows us to represent any arbitrary point around the parametric body (clothing details, hair, etc.) by an embedding vector. Subsequently, given a monocular RGB image of a person, we learn to predict per-pixel ConVol-E embedding, which carries a similar meaning across different subjects and is invariant to pose and appearance, thereby acting as a descriptor to establish robust, dense correspondences across different images of humans. We thoroughly evaluate our methods on publicly available benchmark datasets and show that our methods outperform existing SOTA. Finally, we provide a summary of our contributions and discuss the potential future research directions in this problem domain. We believe that this thesis improves the research landscape for the domain of the human body, pose, shape, and correspondence estimation and helps accelerate progress in this direction.

Year of completion:	June 2024
Advisor :	Avinash Sharma