PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction


Sai Sagar Jinka, Rohan Chacko  Avinash Sharma  P.J. Narayanan  

IIIT Hyderabad

    [Demo Video]   

International Conference on 3D Vision, 2020

Abstract

We introduce PeeledHuman - a novel shape representation of the human body that is robust to self-occlusions. PeeledHuman encodes the human body as a set of Peeled Depth and RGB maps in 2D, obtained by performing raytracing on the 3D body model and extending each ray beyond its first intersection. This formulation allows us to handle self-occlusions efficiently compared to other representations. Given a monocular RGB image, we learn these Peeled maps in an end-to-end generative adversarial fashion using our novel framework - PeelGAN. We train PeelGAN using a 3D Chamfer loss and other 2D losses to generate multiple depth values per-pixel and a corresponding RGB field per-vertex in a dual-branch setup. In our simple non-parametric solution, the generated Peeled Depth maps are back-projected to 3D space to obtain a complete textured 3D shape. The corresponding RGB maps provide vertex-level texture details. We compare our method with current parametric and non-parametric methods in 3D reconstruction and find that we achieve state-of-theart-results. We demonstrate the effectiveness of our representation on publicly available BUFF and MonoPerfCap datasets as well as loose clothing data collected by our calibrated multi-Kinect setup.

motivation

Our proposed representation encodes a human body as a set of Peeled Depth & RGB maps from a given view. These maps are backprojected to 3D space in the camera coordinate frame to recover the 3D human body.

Method

In this paper, we tackle the problem of textured 3D human reconstruction from a single RGB image by introducing a novel shape representation, called PeeledHuman. Our proposed solution derives inspiration from the classical ray tracing approach in computer graphics. We estimate a fixed number of ray intersection points with the human body surface in the canonical view volume for every pixel in an image, yielding a multi-layered shape representation called PeeledHuman. PeeledHuman encodes a 3D shape as a set of depth maps called Peeled Depth maps. We further extend this layered representation to recover texture by capturing a discrete sampling of the continuous surface texture called Peeled RGB maps. Such a layered representation of the body shape addresses severe self-occlusions caused by complex body poses and viewpoint variations. Our representation is similar to depth peeling used in computer graphics for order-independent transparency. The proposed shape representation allows us to recover multiple 3D points that project to the same pixel in the 2D image plane. Thus, we reformulate the solution to the monocular textured 3D body reconstruction task as predicting a set of Peeled Depth & RGB maps. To achieve this dual-prediction task, we propose PeelGAN, a dual-task generative adversarial network that generates a set of depth and RGB maps in two different branches of the network. These predicted peeled maps are then back-projected to 3D space to obtain a point cloud. Our proposed representation enables an end-to-end, non-parametric and differentiable solution for textured 3D body reconstruction. It is important to note that our representation is not restricted only to human body models but can generalize well to any 3D shapes/scenes, given specific training data prior.

 pipeline

PeelGAN overview: The dual-branch network generates Peeled Depth (D) and RGB (R) maps from an input image. The generated maps are each fed to a discriminator: one for RGB and one for Depth maps. The generated maps are backprojected to obtain the 3D human body represented as a point cloud (p) in the camera coordinate frame. We employ a Chamfer loss between the reconstructed 3D human body represented as a point cloud (p̂) point cloud and the ground-truth point cloud (p) along with several other 2D losses on the Peeled maps.


Contributions

  • We introduce PeeledHuman - a novel shape representation of the human body encoded as a set of Peeled Depth and RGB maps, that is robust to severe self-occlusions.
  • Our proposed representation is efficient in terms of both encoding 3D shapes as well as feed-forward time yielding superior quality of reconstructions with faster inference rates.
  • We propose PeelGAN - a complete end-to-end pipeline to reconstruct a textured 3D human body from a single RGB image using an adversarial approach. 
  • We introduce a challenging 3D dataset consisting of multiple human action sequences with variations in shape and pose, draped in loose clothing. We intend to release this data along with our code for academic use.

Related Publication

  • Rohan Chacko, Sai Sagar Jinka, Avinash Sharma, P.J. Narayanan - PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction International Conference on 3D Vision (3DV), 2020

Visual Speech Enhancement Without a Real Visual Stream


Sindhu Hegde*, Prajwal Renukanand*   Rudrabha Mukhopadhyay*   Vinay Namboodiri   C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

[Code]   [Paper]   [Demo Video]   [Test Sets]

 WACV, 2021images

We propose a novel approach to enhance the speech by hallucinating the visual stream for any given noisy audio. In contrast to the existing audio-visual methods, our approach works even in the absence of a reliable visual stream, while also performing better than audio-only works in unconstrained conditions due to the assistance of generated lip movements.

Abstract

In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over ``audio-only" methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a ``visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is almost close (< 3\% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as qualitative human evaluations. Additional ablation studies and a demo video in the supplementary material containing qualitative comparisons and results clearly illustrate the effectiveness of our approach.


Paper

  • Paper
    Visual Speech Enhancement Without A Real Visual Stream

    Sindhu Hegde*, Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
    Visual Speech Enhancement Without A Real Visual Stream, WACV, 2021.
    [PDF] |

    Updated Soon

Demo

Please click on this link : https://www.youtube.com/watch?v=y_oP9t7WEn4&feature=youtu.be


Contact

  1. Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Prajwal K R - This email address is being protected from spambots. You need JavaScript enabled to view it.
  3. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

Improving Word Recognition using Multiple Hypotheses and Deep Embeddings


Siddhant Bansal   Praveen Krishnan   C.V. Jawahar  

ICPR 2020

We propose to fuse recognition-based and recognition-free approaches for word recognition using learning-based methods. For this purpose, results obtained using a text recognizer and deep embeddings (generated using an End2End network) are fused. To further improve the embeddings, we propose EmbedNet, it uses triplet loss for training and learns an embedding space where the embedding of the word image lies closer to its corresponding text transcription’s embedding. This updated embedding space helps in choosing the correct prediction with higher confidence. To further improve the accuracy, we propose a plug-and-play module called Confidence based Accuracy Booster (CAB). It takes in the confidence scores obtained from the text recognizer and Euclidean distances between the embeddings and generates an updated distance vector. This vector has lower distance values for the correct words and higher distance values for the incorrect words. We rigorously evaluate our proposed method systematically on a collection of books that are in the Hindi language. Our method achieves an absolute improvement of around 10% in terms of word recognition accuracy.

For generating the textual transcription, we pass the word image through the CRNN and the End2End network (E2E), simultaneously. The CRNN generates multiple (K) textual transcriptions for the input image, whereas the E2E network generates the word image's embedding. The K textual transcriptions generated by the CRNN are passed through the E2E network to generate their embeddings. We pass these embeddings through the EmbedNet proposed in this work. The EmbedNet projects the input embedding to an updated Euclidean space, using which we get updated word image embedding and K transcriptions' embedding. We calculate the Euclidean distance between the input embedding and each of the K textual transcriptions. We then pass the distance values through the novel Confidence based Accuracy Booster (CAB), which uses them and the confidence scores from the CRNN to generate an updated list of Euclidean distance, which helps in selecting the correct prediction.

 

Impro-Img

For generating the textual transcription, we pass the word image through the CRNN and the End2End network (E2E), simultaneously. The CRNN generates multiple (K) textual transcriptions for the input image, whereas the E2E network generates the word image's embedding. The K textual transcriptions generated by the CRNN are passed through the E2E network to generate their embeddings. We pass these embeddings through the EmbedNet proposed in this work. The EmbedNet projects the input embedding to an updated Euclidean space, using which we get updated word image embedding and K transcriptions' embedding. We calculate the Euclidean distance between the input embedding and each of the K textual transcriptions. We then pass the distance values through the novel Confidence based Accuracy Booster (CAB), which uses them and the confidence scores from the CRNN to generate an updated list of Euclidean distance, which helps in selecting the correct prediction.

 

Paper

  • ArXiv: PDF
  • ICPR: Coming soon!

 

Please consider citing if you make use of this work and/or the corresponding code:
 
@misc{bansal2020improving,
      title={Improving Word Recognition using Multiple Hypotheses and Deep Embeddings}, 
      author={Siddhant Bansal and Praveen Krishnan and C. V. Jawahar},
      year={2020},
      eprint={2010.14411},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Code

This work is implemented using the pytorch neural network framework. Code is available in this GitHub repository: .

Dear Commissioner, please fix these: A scalable system for inspecting road infrastructure


Abstract

Inspecting and assessing the quality of traffic infrastructure(such as the state of the signboards or road markings) is challenging forhumans due to (i) the massive length of roads that countries will have and(ii) the regular frequency at which this needs to be done. In this paper, wedemonstrate a scalable system that uses computer vision for automaticinspection of road infrastructure from a simple video captured from amoving vehicle. We validated our method on 1500kms of roads capturedin and around the city of Hyderabad, India. Qualitative and quantitativeresults demonstrate the feasibility, scalability and effectiveness of oursolution.

 


Related Publicationstract

  • Raghava Modhugu, Ranjith Reddy and C. V. Jawahar - Dear Commissioner, please fix these: A scalable system for inspecting road infrastructure ,National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2019 [pdf]

Dataset

We release traffic sign boards dataset of Indian roads with 51 different classes [Dataset]

DGAZE Dataset for driver gaze mapping on road


Isha Dua   Thrupthi John   Riya Gupta   C.V. Jawahar

Mercedes Benz   IIIT Hyderabad   IIIT Hyderabad   IIIT Hyderabad

Accepeted in IROS 2020

 

road view

road view


Dataset Overview

DGAZE is a new dataset for mapping the driver's gaze onto the road. Currently, driver gaze datasets are collected using eye-tracking hardware which are expensive and cumbersome, and thus unsuited for use during testing. Thus, our dataset is designed so that no costly equipment is required during test time. Models trained using our dataset requires only a dashboard-mounted mobile phone during deployment, as our data is collected using mobile phones. We collect the data in a lab setting with a video of a road projected in front of the driver. We overcome the limitation of not using eye trackers by annotating points on the road video and asking the drivers to look at them. For more details, please refer to our paper.
[ paper ]
We collected road videos using mobile phones mounted on the dashboards of cars driven in the city. We combined the road videos to create a single 18-minute video that had a good mix of road, lighting, and traffic conditions. The road images have varied illumination as the images are captured from morning to evening in the real cars on actual roads. For each frame, we annotated a single object belonging to one of the classes: car, bus, motorbike, pedestrian, auto-rickshaw, traffic signal and sign board. We also marked the center of each bounding box to serve as the groundtruth for the point detection challenge. We annotated objects that typically take up a driver's attention such as relevant signage, pedestrians, and intercepting vehicles.

Use Cases

The task of driver eye-gaze can be solved as point-wise prediction or object-wise prediction. We provide annotation so that our dataset can be used for pointwise as well as object-wise prediction. Both types of eye-gaze prediction are useful. Predicting the object which the driver is looking at is useful for higher-level ADAS systems. This may be done by getting object candidates using an object detection algorithm and using the eye gaze to predict which object is being observed. Object prediction can be used to determine whether a driver is focusing on a pedestrian or if they noticed a signboard for example. Point-wise prediction is much more fine-grained and are more useful for nearby objects, as they show which part of the object is being focused on. They can be used to determine the saccade patterns of the eyes or to create a heatmap of the attention of a driver. They may even be converted into object-wise prediction. Our dataset allows both types of analyses to be conducted.

Citation

All documents and papers that report research results obtained using the DGAZE dataset should cite the below paper:
Citation: Isha Dua, Thrupthi Ann John, Riya Gupta and C. V. Jawahar. DGAZE:Diver Gaze Mapping On Road In IROS 2020.


Bibtex

 @inproceedings{isha2020iros, 
title={DGAZE: Driver Gaze Mapping on Road},
author={Dua, Isha and John, Thrupthi Ann and Gupta, Riya and Jawahar, CV},
booktitle={2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020},
year={2020}
}

Download Dataset

The DGAZE dataset is available free of charge for research purposes only. Please use the link below to download the dataset (password protected zip file). We request you to send a mail to Thrupthi Ann John and Isha Dua (contact details below) with the duly filled dataset agreement form, upon which we will send you the password. .
Disclaimer: While every effort has been made to ensure accuracy, DGaze database owners cannot accept responsibility for errors or omissions.
[Download Dataset Agreement ] [Download Dataset] [Download README]


Acknowledgements

We would like to thank Akshay Uttama Nambi and Venkat Padmanabhan from Microsoft Research for providing us with resources to collect DGAZE dataset. This work is partly supported by DST through the IMPRINT program. Thrupthi Ann John is supported by Visvesvaraya Ph.D. fellowship.


Contact

For any queries about the dataset, please contact the authors below:
Isha Dua: This email address is being protected from spambots. You need JavaScript enabled to view it.
Thrupthi Ann John: This email address is being protected from spambots. You need JavaScript enabled to view it.