Towards Automatic Face-to-Face Translation

ACM Multimedia 2019

[Code]   [Data]

F2FT banner

Given a speaker speaking in a language L$_A$ (Hindi in this case), our fully-automated system generates a video of the speaker speaking in L$_B$ (English). Here, we illustrate a potential real-world application of such a system where two people can engage in a natural conversation in their own respective languages.


In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach of what we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is the need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set, shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages.


  • Face-to-Face Paper
    Towards Automatic Face-to-Face Translation

    Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Jerin Philip, Abhishek Jha, Vinay Namboodiri and C.V. Jawahar
    Towards Automatic Face-to-Face Translation, ACM Multimedia, 2019.
    [PDF] |

      author={S. Ginosar and A. Bar and G. Kohavi and C. Chan and A. Owens and J. Malik},
      title = {Learning Individual Styles of Conversational Gesture},
      booktitle = {Computer Vision and Pattern Recognition (CVPR)}
      publisher = {IEEE},

Speech-to-Speech Translation

Our system can be widely divided into two sub-systems. We do speech-to-speech translation by combining ASR, NMT and TTS. We first use a publicly available ASR to get the text transcript. For English we use DeepSpeech for transcribing English text from audio. We use a suitable publicly available ASR for other languages like Hindi and French. We train our own NMT system for different Indian languages using Facebook AI Research's publicly available codebase. We finally train a TTS for each language of our choice. We curate datasets for Indian languages and also

Point Cloud Analysis

In our research group, we are working on LiDAR scans, specially captured from airborne scanners. We are attempting to segment semantically these scans. As the size of the data is too huge compared with the indoor scans as in the Matterport, S3DIS or ScnanNet dataset, we are exploring methods based on grpah convolution, superpoint graphs etc. Some of the use cases of this segmentaiton task are inspection and maintenance of power line cables and towers. With the lack of precisely labelled data for using deep learning based solutions, we are using statistical properties for segmentation. Anfd this lack of data is also motivating us to come up with simulation of LiDAR scnas that could be used for generation ov virtual 3d scenes.

We are working of formulating a novel approach towards generation of realistic 3D environment by simulating LiDAR scans, initially targeted to wards scenes
from forest environents involving trees, mountains, low vegetation, grasslands etc. Genration of such realistic scnes find applcaitons in AR/VR, smart city
plannning, asset management or distribution systems etc. The major problems that we are solving are generation of semantically coherent set of points that
comprise large objects in the scene. Please watch out this space for more updates on this work.


A sample LIDAR scan of a street from semantic3D dataset

Leveraging the power of deep learning networks, we want to come up with a descriptor that will be strong the deep learning version of BShot descriptor. Engineered keep point descriptors have eveloved over the last 2 decades. but with the deep learning era, for images, such descriptos are becoming obselete.
Analogously, for 3D, with large amount of annotated datasets such as Matterport, S3DIS or ScnanNet, research community has started exploring deep learing versions of existing 3D keypint descriptors. Here, we are exploring ways of unsupervised method to learn simutaneous detection of keypoint and generation of robust descriptor.

multiple pcls

Indoor point clouds captured with two different sensors from scannet dataset

3D Human body estimation

Reconstructing 3D model with texture of human shapes from images is a challenging task as the object geometry of non-rigid human shapes evolve over time, yielding a large space of complex body poses as well as shape variations. In addition to this, there are several other challenges such as self-occlusions by body parts, obstructions due to free form clothing, background clutter (in a non-studio setup), sparse set of cameras with non-overlapping fields of views, sensor noise, etc as shown in the figure below

BMVC Motivation
Challenges in non-rigid reconstruction. (a) Complex poses (b) Clothing obstructions (c) Shape variations (d) Background clutter

Deep Textured 3D reconstruction of human bodies


Recovering textured 3D models of non-rigid human body shapes is challenging due to self- occlusions caused by complex body poses and shapes, clothing obstructions, lack of surface texture, background clutter, sparse set of cameras with non-overlapping fields of view, etc. Further, a calibration-free environment adds additional complexity to both - reconstruction and texture recovery. In this paper, we propose a deep learning based solution for textured 3D reconstruction of human body shapes from a single view RGB image. This is achieved by first recovering the volumetric grid of the non-rigid human body given a single view RGB image followed by orthographic texture view synthesis using the respective depth projection of the reconstructed (volumetric) shape and input RGB image. We propose to co-learn the depth information readily available with affordable RGBD sensors (e.g., Kinect) while showing multiple views of the same object during the training phase. We show superior reconstruction performance in terms of quantitative and qualitative results, on both, publicly available datasets (by simulating the depth channel with virtual Kinect) as well as real RGBD data collected with our calibrated multi Kinect setup.


In this work, we propose a deep learning based solution for textured 3D reconstruction of human body shapes given an input RGB image, in a calibration-free environment. Given a single view RGB image, both reconstruction and texture generation are ill-posed problems. Thus, we proposed to co-learn the depth cues (using depth images obtained from affordable sensors like Kinect) with RGB images while training the network. This helps the network learn the space of complex body poses, which otherwise is difficult with just 2D content in RGB images. Although we propose to learn the reconstruction network with multi-view RGB and depth images (shown one at a time during training), co-learning them with shared filters enabled us to recover 3D volumetric shapes using just a single RGB image at test time. Apart from the challenge of non-rigid poses, the depth information also helps addressing the challenges caused by cluttered background, shape variations and free form clothing. Our texture recovery network uses a variational auto- encoder to generate orthographic texture images of reconstructed body models that are subsequently backprojected to recover a texture 3D mesh model.

BMVC Pipeline

Proposed end-to-end deep learning pipeline for reconstructing textured non-rigid 3D human body models. Using a single view perspective RGB image (a), we perform a voxelized 3D reconstruction (c) using the reconstruction network (b). Then, to add texture to the generated 3D model, we first convert the voxels to a mesh representation using Poisson’s surface reconstruction algorithm [9] and capture its four orthographic depth maps (d). These are fed as an input to the texture recovery network (f), along with the perspective RGB views used for reconstruction (a). The texture recovery network produces orthographic RGB images (g), that are back-projected onto the reconstructed model, to obtain the textured 3D model (h).


  • We introduce a novel deep learning pipeline to obtain textured 3D models of nonrigid human body shapes from a single image. To the best of our knowledge, obtaining the reconstruction of non-rigid shapes in a volumetric form (whose advantages we demonstrate) has not yet been attempted in literature. Further, this would be an initial effort in the direction of single view non- rigid reconstruction and texture recovery in an end-to-end manner.
  • We demonstrate the importance of depth cues (used only at train time) for the task of non-rigid reconstruction. This is achieved by our novel training methodology of alternating RGB and D in order to capture the large space of pose and shape deformation.
  • We show that our model can partially handle non-rigid deformations induced by free form clothing, as we do not impose any model constraint while training the volumetric reconstruction network.
  • We proposed to use depth cues for texture recovery in the variational autoencoder setup. This is the first attempt to do so in texture synthesis literature.

Related Publication:

  • Abhinav Venkat, Sai Sagar Jinka, Avinash Sharma - Deep Textured 3D reconstruction of human bodies, British Machine Vision Conference (BMVC2018). [pdf]

SplineNet: B-spline neural network for efficient classification of 3D data

ICVGIP motivation


Majority of recent deep learning pipelines for 3D shapes uses volumetric representation, extending the concept of 2D convolution to 3D domain. Nevertheless, the volumetric representation poses a serious computational disadvantage as most of the voxel grids are empty and results in redundant computation. Moreover, a 3D shape is determined by its surface and hence performing convolutions on the voxels inside the shape is sheer wastage of computation. In this paper, we focus on constructing a novel, fast and robust characterization of 3D shapes that accounts for local geometric variations as well as global structure. We built up on the learning scheme of Field Probing Neural Network [FPNN] by introducing sets of B-spline surfaces instead of point filters, in order to sense complex geometrical structures (large curvature variations). The locations of these surfaces are initialized over the voxel space and are learned during training
phase. We propose SplineNet, a deep network consisting of B-spline surfaces for classification of input 3D data represented in volumetric grid. We derive analytical solutions for updates of B-spline surfaces during back propagation.


In this paper, we focus on constructing a novel, fast and robust characterization of 3D shapes that accounts for local information as well as global geometry. We built up on the learning scheme of [FPNN] by introducing sets of B-spline surfaces instead of point filters, in order to sense complex geometrical structures (large curvature variations). The locations of these surfaces are initialized randomly over the voxel space and are learned over training phase. We modify the dot product layer of [FPNN] to aggregate local sensing and provide the global characterization of the input data.

ICVGIP motivation

Figure: Overview of our SplineNet architecture. Input shapes represented in volumetric fields are fed to
SplinePatch layer for effective local sensing which is then optionally passed to Gaussian layer to
retain values near surface boundaries. LocalAggregation layer accumulates local sensing to give
local geometry aware global characterization of input shapes. Resulting characterization is
fed to Fully Connected(FC) layers from which class label is predicted


  • We proposed SplineNet, a deep network consist of B-spline surfaces for classification of input 3D data represented in volumetric fields. To the best of our knowledge, parametric curves and surfaces are not proposed in a learning setup in deep neural network for classification applications.
  • We derived analytical solutions for updates of B-spline surfaces during back propagation.

Related Publication:

  • Sai Sagar Jinka, Avinash Sharma - SplineNet: B-spline neural network for efficient classification of 3D data, Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP2018). [pdf]

3D Computer Vision


We are interested in computer vision and machine learning with a focus on 3D scene understanding, reconstruction etc. In particular, we deal with problems where human body is reconstructed from 2D images and analysed in 3D, registration of point clouds of indoor scenes captured from commodity sensors as well as large outdoor scenes captured from LIDAR scanners. As 3D data is represented in various formats like mesh, pointclouds, volumetric representations to name a few, we also design novel algorithms for making 3D data compatible with deep learning framework. We investigate how complex prior knowledge can be incorporated into computer vision algorithms for making them robust to variations in our complex 3D world. Below links are detailed descriptions of our individual works

3D Human body estimation

People Involved: Dr.Avinash Sharma, Sai Sagar J, Abbhinav Venkat, Chaitanya Patal, Anubhab Sen, Saurabh Rajguru, Neeraj Bhattan, Yudhik Aggarwal, Himansh Sheron   

As we are aware of humans often as the subject of photographs, detecting them and analyzing their shape and pose in 3D is significant for applications such as AR/VR, motion capture etc. Our research objective in this work is to recover 3D human body from a single RGB image. [See More] optimize1 video2

Classification of 3D data

People Involved: Dr.Avinash Sharma, Sai Sagar J and Raj Manvar

Apart from acquisition of 3D shapes, analyzing these shapes is also an important task. There are several shape analysis tasks like retrieval, classification, dense and sparse correspondence of two shapes, segmentation etc. The core of the 3D shape analysis lies at the 3D shape descriptors. Descriptor construction is usually application dependent. One expects the descriptor to be discriminative, invariant to some transformations or noise and compact i.e. low dimensional. Recently, few deep learning models for shape descriptors have appeared. Nevertheless, many of the existing deep learning works didn’t exploit the fact that 3D shapes are boundary based i.e. much of the information essential for classification is hinged on boundary. Performing convolution in volumetric space hinders the computing efficiency. In the following work, we address these particular drawbacks and construct novel descriptor that accounts for local geometry aware global characterization for rigib 3D shapes. [See More]

mesh pcl

Point Cloud Analysis

People Involved: Dr.Avinash Sharma, Ashish kubade and Nikhilendra

Working on a point cloud involves tackling the inherent challenges as sparseness, lack of order among points, invariance to view ports etc. For LiDAR scans, additionally, we have to handle the scale of the data as a typical outdoor LiDAR scan has point in the order of millions. Breaking such scene, into smaller chunks might would be an initial guess, however, working on such small chunks unlimately results in loss of global semantics of the scene.[See More]

plc 02

LIDAR scan of a valley