Surface Reconstruction


Recovering a 3D human body shape from a monocular image is an ill-posed problem in computer vision with great practical importance for many applications, including virtual and augmented reality platforms, animation industry, e-commerce domain, etc.

[Video]

HumanMeshNet: Polygonal Mesh Recovery of Humans

Abstract:

3D Human Body Reconstruction from a monocular image is an important problem in computer vision with applications in virtual and augmented reality platforms, animation industry, en-commerce domain, etc. While several of the existing works formulate it as a volumetric or parametric learning with complex and indirect reliance on re-projections of the mesh, we would like to focus on implicitly learning the mesh representation. To that end, we propose a novel model, HumanMeshNet, that regresses a template mesh's vertices, as well as receives a regularization by the 3D skeletal locations in a multi-branch, multi-task setup. The image to mesh vertex regression is further regularized by the neighborhood constraint imposed by mesh topology ensuring smooth surface reconstruction. The proposed paradigm can theoretically learn local surface deformations induced by body shape variations and can therefore learn high-resolution meshes going ahead. We show comparable performance with SoA (in terms of surface and joint error) with far lesser computational complexity, modeling cost and therefore real-time reconstructions on three publicly available datasets. We also show the generalizability of the proposed paradigm for a similar task of predicting hand mesh models. Given these initial results, we would like to exploit the mesh topology in an explicit manner going ahead.

Method

In this paper, we attempt to work in between a generic point cloud and a mesh - i.e., we learn an "implicitly structured" point cloud. Attempting to produce high resolution meshes are a natural extension that is easier in 3D space than in the parametric one. We present an initial solution in that direction - HumanMeshNet that simultaneously performs shape estimation by regressing to template mesh vertices (by minimizing surface loss) as well receives a body pose regularisation from a parallel branch in multi-task setup. Ours is a relatively simpler model as compared to the majority of the existing methods for volumetric and parametric model prediction (e.g.,Bodynet). This makes it efficient in terms of network size as well as feed forward time yielding significantly high frame-rate reconstructions. At the same time, our simpler network achieves comparable accuracy in terms of surface and joint error w.r.t. majority of state-of-the-art techniques on three publicly available datasets. The proposed paradigm can theoretically learn local surface deformations induced by body shape variations which the PCA space of parametric body models can't capture. In addition to predicting the body model, we also show the generalizability of our proposed idea for solving a similar task with different structure - non-rigid hand mesh reconstructions from a monocular image.

 Pipeline

We we propose a novel model, HumanMeshNet which is a Multi-Task 3D Human Mesh Reconstruction Pipeline. Given a monocular RGB image (a), we first extract a body part-wise segmentation mask using Densepose (b). Then, using a joint embedding of both the RGB and segmentation mask (c), we predict the 3D joint locations (d) and the 3D mesh (e), in a multi-task setup. The 3D mesh is predicted by first applying a mesh regularizer on the predicted point cloud. Finally, the loss is minimized on both the branches (d) and (e).

Contributions:

  • We propose a simple end-to-end multi-branch, multi-task deep network that exploits a "structured point cloud" to recover a smooth and fixed topology mesh model from a monocular image.
  • The proposed paradigm can theoretically learn local surface deformations induced by body shape variations which the PCA space of parametric body models can't capture.
  • The simplicity of the model makes it efficient in terms of network size as well as feed forward time yielding significantly high frame-rate reconstructions, while simultaneously achieving comparable accuracy in terms of surface and joint error, as shown on three publicly available datasets.
  • We also show the generalizability of our proposed paradigm for a similar task of reconstructing the hand mesh models from a monocular image.

Related Publication:

  • Abbhinav Venkat, Chaitanya Patel, Yudhik Agrawal, Avinash Sharma - HumanMeshNet: Polygonal Mesh Recovery of Humans, (ICCV-3DRW 2019). [pdf], [Video], [Code]

Towards Automatic Face-to-Face Translation


ACM Multimedia 2019

[Code]   [Data]

F2FT banner

Given a speaker speaking in a language L$_A$ (Hindi in this case), our fully-automated system generates a video of the speaker speaking in L$_B$ (English). Here, we illustrate a potential real-world application of such a system where two people can engage in a natural conversation in their own respective languages.

Abstract

In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach of what we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is the need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set, shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages.


Paper

  • Face-to-Face Paper
    Towards Automatic Face-to-Face Translation

    Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Jerin Philip, Abhishek Jha, Vinay Namboodiri and C.V. Jawahar
    Towards Automatic Face-to-Face Translation, ACM Multimedia, 2019.
    [PDF] |

    @InProceedings{ginosar2019gestures,
      author={S. Ginosar and A. Bar and G. Kohavi and C. Chan and A. Owens and J. Malik},
      title = {Learning Individual Styles of Conversational Gesture},
      booktitle = {Computer Vision and Pattern Recognition (CVPR)}
      publisher = {IEEE},
      year={2019},
      month=jun
    }

Speech-to-Speech Translation

Our system can be widely divided into two sub-systems. We do speech-to-speech translation by combining ASR, NMT and TTS. We first use a publicly available ASR to get the text transcript. For English we use DeepSpeech for transcribing English text from audio. We use a suitable publicly available ASR for other languages like Hindi and French. We train our own NMT system for different Indian languages using Facebook AI Research's publicly available codebase. We finally train a TTS for each language of our choice. We curate datasets for Indian languages and also

SplineNet: B-spline neural network for efficient classification of 3D data


ICVGIP motivation

Abstract:

Majority of recent deep learning pipelines for 3D shapes uses volumetric representation, extending the concept of 2D convolution to 3D domain. Nevertheless, the volumetric representation poses a serious computational disadvantage as most of the voxel grids are empty and results in redundant computation. Moreover, a 3D shape is determined by its surface and hence performing convolutions on the voxels inside the shape is sheer wastage of computation. In this paper, we focus on constructing a novel, fast and robust characterization of 3D shapes that accounts for local geometric variations as well as global structure. We built up on the learning scheme of Field Probing Neural Network [FPNN] by introducing sets of B-spline surfaces instead of point filters, in order to sense complex geometrical structures (large curvature variations). The locations of these surfaces are initialized over the voxel space and are learned during training
phase. We propose SplineNet, a deep network consisting of B-spline surfaces for classification of input 3D data represented in volumetric grid. We derive analytical solutions for updates of B-spline surfaces during back propagation.

Method

In this paper, we focus on constructing a novel, fast and robust characterization of 3D shapes that accounts for local information as well as global geometry. We built up on the learning scheme of [FPNN] by introducing sets of B-spline surfaces instead of point filters, in order to sense complex geometrical structures (large curvature variations). The locations of these surfaces are initialized randomly over the voxel space and are learned over training phase. We modify the dot product layer of [FPNN] to aggregate local sensing and provide the global characterization of the input data.

ICVGIP motivation

Figure: Overview of our SplineNet architecture. Input shapes represented in volumetric fields are fed to
SplinePatch layer for effective local sensing which is then optionally passed to Gaussian layer to
retain values near surface boundaries. LocalAggregation layer accumulates local sensing to give
local geometry aware global characterization of input shapes. Resulting characterization is
fed to Fully Connected(FC) layers from which class label is predicted

Contributions:

  • We proposed SplineNet, a deep network consist of B-spline surfaces for classification of input 3D data represented in volumetric fields. To the best of our knowledge, parametric curves and surfaces are not proposed in a learning setup in deep neural network for classification applications.
  • We derived analytical solutions for updates of B-spline surfaces during back propagation.

Related Publication:

  • Sai Sagar Jinka, Avinash Sharma - SplineNet: B-spline neural network for efficient classification of 3D data, Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP2018). [pdf]

Point Cloud Analysis


In our research group, we are working on LiDAR scans, specially captured from airborne scanners. We are attempting to segment semantically these scans. As the size of the data is too huge compared with the indoor scans as in the Matterport, S3DIS or ScnanNet dataset, we are exploring methods based on grpah convolution, superpoint graphs etc. Some of the use cases of this segmentaiton task are inspection and maintenance of power line cables and towers. With the lack of precisely labelled data for using deep learning based solutions, we are using statistical properties for segmentation. Anfd this lack of data is also motivating us to come up with simulation of LiDAR scnas that could be used for generation ov virtual 3d scenes.

We are working of formulating a novel approach towards generation of realistic 3D environment by simulating LiDAR scans, initially targeted to wards scenes
from forest environents involving trees, mountains, low vegetation, grasslands etc. Genration of such realistic scnes find applcaitons in AR/VR, smart city
plannning, asset management or distribution systems etc. The major problems that we are solving are generation of semantically coherent set of points that
comprise large objects in the scene. Please watch out this space for more updates on this work.

LIDAR2

A sample LIDAR scan of a street from semantic3D dataset

Leveraging the power of deep learning networks, we want to come up with a descriptor that will be strong the deep learning version of BShot descriptor. Engineered keep point descriptors have eveloved over the last 2 decades. but with the deep learning era, for images, such descriptos are becoming obselete.
Analogously, for 3D, with large amount of annotated datasets such as Matterport, S3DIS or ScnanNet, research community has started exploring deep learing versions of existing 3D keypint descriptors. Here, we are exploring ways of unsupervised method to learn simutaneous detection of keypoint and generation of robust descriptor.

multiple pcls

Indoor point clouds captured with two different sensors from scannet dataset

3D Human body estimation


Reconstructing 3D model with texture of human shapes from images is a challenging task as the object geometry of non-rigid human shapes evolve over time, yielding a large space of complex body poses as well as shape variations. In addition to this, there are several other challenges such as self-occlusions by body parts, obstructions due to free form clothing, background clutter (in a non-studio setup), sparse set of cameras with non-overlapping fields of views, sensor noise, etc as shown in the figure below

BMVC Motivation
Challenges in non-rigid reconstruction. (a) Complex poses (b) Clothing obstructions (c) Shape variations (d) Background clutter

Deep Textured 3D reconstruction of human bodies

Abstract:

Recovering textured 3D models of non-rigid human body shapes is challenging due to self- occlusions caused by complex body poses and shapes, clothing obstructions, lack of surface texture, background clutter, sparse set of cameras with non-overlapping fields of view, etc. Further, a calibration-free environment adds additional complexity to both - reconstruction and texture recovery. In this paper, we propose a deep learning based solution for textured 3D reconstruction of human body shapes from a single view RGB image. This is achieved by first recovering the volumetric grid of the non-rigid human body given a single view RGB image followed by orthographic texture view synthesis using the respective depth projection of the reconstructed (volumetric) shape and input RGB image. We propose to co-learn the depth information readily available with affordable RGBD sensors (e.g., Kinect) while showing multiple views of the same object during the training phase. We show superior reconstruction performance in terms of quantitative and qualitative results, on both, publicly available datasets (by simulating the depth channel with virtual Kinect) as well as real RGBD data collected with our calibrated multi Kinect setup.

Method

In this work, we propose a deep learning based solution for textured 3D reconstruction of human body shapes given an input RGB image, in a calibration-free environment. Given a single view RGB image, both reconstruction and texture generation are ill-posed problems. Thus, we proposed to co-learn the depth cues (using depth images obtained from affordable sensors like Kinect) with RGB images while training the network. This helps the network learn the space of complex body poses, which otherwise is difficult with just 2D content in RGB images. Although we propose to learn the reconstruction network with multi-view RGB and depth images (shown one at a time during training), co-learning them with shared filters enabled us to recover 3D volumetric shapes using just a single RGB image at test time. Apart from the challenge of non-rigid poses, the depth information also helps addressing the challenges caused by cluttered background, shape variations and free form clothing. Our texture recovery network uses a variational auto- encoder to generate orthographic texture images of reconstructed body models that are subsequently backprojected to recover a texture 3D mesh model.

BMVC Pipeline

Proposed end-to-end deep learning pipeline for reconstructing textured non-rigid 3D human body models. Using a single view perspective RGB image (a), we perform a voxelized 3D reconstruction (c) using the reconstruction network (b). Then, to add texture to the generated 3D model, we first convert the voxels to a mesh representation using Poisson’s surface reconstruction algorithm [9] and capture its four orthographic depth maps (d). These are fed as an input to the texture recovery network (f), along with the perspective RGB views used for reconstruction (a). The texture recovery network produces orthographic RGB images (g), that are back-projected onto the reconstructed model, to obtain the textured 3D model (h).

Contributions:

  • We introduce a novel deep learning pipeline to obtain textured 3D models of nonrigid human body shapes from a single image. To the best of our knowledge, obtaining the reconstruction of non-rigid shapes in a volumetric form (whose advantages we demonstrate) has not yet been attempted in literature. Further, this would be an initial effort in the direction of single view non- rigid reconstruction and texture recovery in an end-to-end manner.
  • We demonstrate the importance of depth cues (used only at train time) for the task of non-rigid reconstruction. This is achieved by our novel training methodology of alternating RGB and D in order to capture the large space of pose and shape deformation.
  • We show that our model can partially handle non-rigid deformations induced by free form clothing, as we do not impose any model constraint while training the volumetric reconstruction network.
  • We proposed to use depth cues for texture recovery in the variational autoencoder setup. This is the first attempt to do so in texture synthesis literature.

Related Publication:

  • Abhinav Venkat, Sai Sagar Jinka, Avinash Sharma - Deep Textured 3D reconstruction of human bodies, British Machine Vision Conference (BMVC2018). [pdf]