Point Cloud Analysis

In our research group, we are working on LiDAR scans, specially captured from airborne scanners. We are attempting to segment semantically these scans. As the size of the data is too huge compared with the indoor scans as in the Matterport, S3DIS or ScnanNet dataset, we are exploring methods based on grpah convolution, superpoint graphs etc. Some of the use cases of this segmentaiton task are inspection and maintenance of power line cables and towers. With the lack of precisely labelled data for using deep learning based solutions, we are using statistical properties for segmentation. Anfd this lack of data is also motivating us to come up with simulation of LiDAR scnas that could be used for generation ov virtual 3d scenes.

We are working of formulating a novel approach towards generation of realistic 3D environment by simulating LiDAR scans, initially targeted to wards scenes
from forest environents involving trees, mountains, low vegetation, grasslands etc. Genration of such realistic scnes find applcaitons in AR/VR, smart city
plannning, asset management or distribution systems etc. The major problems that we are solving are generation of semantically coherent set of points that
comprise large objects in the scene. Please watch out this space for more updates on this work.


A sample LIDAR scan of a street from semantic3D dataset

Leveraging the power of deep learning networks, we want to come up with a descriptor that will be strong the deep learning version of BShot descriptor. Engineered keep point descriptors have eveloved over the last 2 decades. but with the deep learning era, for images, such descriptos are becoming obselete.
Analogously, for 3D, with large amount of annotated datasets such as Matterport, S3DIS or ScnanNet, research community has started exploring deep learing versions of existing 3D keypint descriptors. Here, we are exploring ways of unsupervised method to learn simutaneous detection of keypoint and generation of robust descriptor.

multiple pcls

Indoor point clouds captured with two different sensors from scannet dataset

SplineNet: B-spline neural network for efficient classification of 3D data

ICVGIP motivation


Majority of recent deep learning pipelines for 3D shapes uses volumetric representation, extending the concept of 2D convolution to 3D domain. Nevertheless, the volumetric representation poses a serious computational disadvantage as most of the voxel grids are empty and results in redundant computation. Moreover, a 3D shape is determined by its surface and hence performing convolutions on the voxels inside the shape is sheer wastage of computation. In this paper, we focus on constructing a novel, fast and robust characterization of 3D shapes that accounts for local geometric variations as well as global structure. We built up on the learning scheme of Field Probing Neural Network [FPNN] by introducing sets of B-spline surfaces instead of point filters, in order to sense complex geometrical structures (large curvature variations). The locations of these surfaces are initialized over the voxel space and are learned during training
phase. We propose SplineNet, a deep network consisting of B-spline surfaces for classification of input 3D data represented in volumetric grid. We derive analytical solutions for updates of B-spline surfaces during back propagation.


In this paper, we focus on constructing a novel, fast and robust characterization of 3D shapes that accounts for local information as well as global geometry. We built up on the learning scheme of [FPNN] by introducing sets of B-spline surfaces instead of point filters, in order to sense complex geometrical structures (large curvature variations). The locations of these surfaces are initialized randomly over the voxel space and are learned over training phase. We modify the dot product layer of [FPNN] to aggregate local sensing and provide the global characterization of the input data.

ICVGIP motivation

Figure: Overview of our SplineNet architecture. Input shapes represented in volumetric fields are fed to
SplinePatch layer for effective local sensing which is then optionally passed to Gaussian layer to
retain values near surface boundaries. LocalAggregation layer accumulates local sensing to give
local geometry aware global characterization of input shapes. Resulting characterization is
fed to Fully Connected(FC) layers from which class label is predicted


  • We proposed SplineNet, a deep network consist of B-spline surfaces for classification of input 3D data represented in volumetric fields. To the best of our knowledge, parametric curves and surfaces are not proposed in a learning setup in deep neural network for classification applications.
  • We derived analytical solutions for updates of B-spline surfaces during back propagation.

Related Publication:

  • Sai Sagar Jinka, Avinash Sharma - SplineNet: B-spline neural network for efficient classification of 3D data, Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP2018). [pdf]

3D Computer Vision


We are interested in computer vision and machine learning with a focus on 3D scene understanding, reconstruction etc. In particular, we deal with problems where human body is reconstructed from 2D images and analysed in 3D, registration of point clouds of indoor scenes captured from commodity sensors as well as large outdoor scenes captured from LIDAR scanners. As 3D data is represented in various formats like mesh, pointclouds, volumetric representations to name a few, we also design novel algorithms for making 3D data compatible with deep learning framework. We investigate how complex prior knowledge can be incorporated into computer vision algorithms for making them robust to variations in our complex 3D world. Below links are detailed descriptions of our individual works

HumanBody Capture

Surface Reconstruction

People Involved: Abbhinav Venkat, Chaitanya Patel, Yudhik Agrawal, Avinash Sharma

Recovering a 3D human body shape from a monocular image is an ill-posed problem in computer vision with great practical importance for many applications, including virtual and augmented reality platforms, animation industry, e-commerce domain, etc. [See More]

PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction

People Involved: Sai Sagar Jinka*, Rohan Chacko* Avinash Sharma P.J. Narayanan

We introduce PeeledHuman - a novel shape representation of the human body that is robust to self-occlusions. PeeledHuman encodes the human body as a set of Peeled Depth and RGB maps in 2D, obtained by performing raytracing on the 3D body model and extending each ray beyond its first intersection. This formulation allows us to handle self-occlusions efficiently compared to other representations. Given a monocular RGB image, we learn these Peeled maps in an end-to-end generative adversarial fashion using our novel framework - PeelGAN. We train PeelGAN using a 3D Chamfer loss and other 2D losses to generate multiple depth values per-pixel and a corresponding RGB field per-vertex in a dual-branch setup. In our simple non-parametric solution, the generated Peeled Depth maps are back-projected to 3D space to obtain a complete textured 3D shape. The corresponding RGB maps provide vertex-level texture details. We compare our method with current parametric and non-parametric methods in 3D reconstruction and find that we achieve state-of-theart-results. We demonstrate the effectiveness of our representation on publicly available BUFF and MonoPerfCap datasets as well as loose clothing data collected by our calibrated multi-Kinect setup. [See More]

Volumetric Reconstruction

People Involved: Dr.Avinash Sharma, Sai Sagar J, Abbhinav Venkat, Chaitanya Patal, Anubhab Sen, Saurabh Rajguru, Neeraj Bhattan, Yudhik Agrawal, Himansh Sheron   

As we are aware of humans often as the subject of photographs, detecting them and analyzing their shape and pose in 3D is significant for applications such as AR/VR, motion capture etc. Our research objective in this work is to recover 3D human body from a single RGB image. [See More]

ezgif.com optimize1 ezgif.com video2

Classification of 3D data

People Involved: Dr.Avinash Sharma, Sai Sagar J and Raj Manvar

Apart from acquisition of 3D shapes, analyzing these shapes is also an important task. There are several shape analysis tasks like retrieval, classification, dense and sparse correspondence of two shapes, segmentation etc. The core of the 3D shape analysis lies at the 3D shape descriptors. Descriptor construction is usually application dependent. One expects the descriptor to be discriminative, invariant to some transformations or noise and compact i.e. low dimensional. Recently, few deep learning models for shape descriptors have appeared. Nevertheless, many of the existing deep learning works didn’t exploit the fact that 3D shapes are boundary based i.e. much of the information essential for classification is hinged on boundary. Performing convolution in volumetric space hinders the computing efficiency. In the following work, we address these particular drawbacks and construct novel descriptor that accounts for local geometry aware global characterization for rigib 3D shapes. [See More]

mesh pcl

3D Action Retrieval

People Involved:Neeraj Battan, Abbhinav Venkat, Avinash Sharma

3D Human Action Indexing and Retrieval is an interesting problem due to the rise of several data-driven applications aimed at analyzing and/or reutilizing 3D human skeletal data, such as data-driven animation, analysis of sports bio-mechanics, human surveillance etc. [See More]


Point Cloud Analysis

People Involved: Dr.Avinash Sharma, Ashish kubade and Nikhilendra

Working on a point cloud involves tackling the inherent challenges as sparseness, lack of order among points, invariance to view ports etc. For LiDAR scans, additionally, we have to handle the scale of the data as a typical outdoor LiDAR scan has point in the order of millions. Breaking such scene, into smaller chunks might would be an initial guess, however, working on such small chunks unlimately results in loss of global semantics of the scene.[See More]

plc 02

LIDAR scan of a valley


3D Human body estimation

Reconstructing 3D model with texture of human shapes from images is a challenging task as the object geometry of non-rigid human shapes evolve over time, yielding a large space of complex body poses as well as shape variations. In addition to this, there are several other challenges such as self-occlusions by body parts, obstructions due to free form clothing, background clutter (in a non-studio setup), sparse set of cameras with non-overlapping fields of views, sensor noise, etc as shown in the figure below

BMVC Motivation
Challenges in non-rigid reconstruction. (a) Complex poses (b) Clothing obstructions (c) Shape variations (d) Background clutter

Deep Textured 3D reconstruction of human bodies


Recovering textured 3D models of non-rigid human body shapes is challenging due to self- occlusions caused by complex body poses and shapes, clothing obstructions, lack of surface texture, background clutter, sparse set of cameras with non-overlapping fields of view, etc. Further, a calibration-free environment adds additional complexity to both - reconstruction and texture recovery. In this paper, we propose a deep learning based solution for textured 3D reconstruction of human body shapes from a single view RGB image. This is achieved by first recovering the volumetric grid of the non-rigid human body given a single view RGB image followed by orthographic texture view synthesis using the respective depth projection of the reconstructed (volumetric) shape and input RGB image. We propose to co-learn the depth information readily available with affordable RGBD sensors (e.g., Kinect) while showing multiple views of the same object during the training phase. We show superior reconstruction performance in terms of quantitative and qualitative results, on both, publicly available datasets (by simulating the depth channel with virtual Kinect) as well as real RGBD data collected with our calibrated multi Kinect setup.


In this work, we propose a deep learning based solution for textured 3D reconstruction of human body shapes given an input RGB image, in a calibration-free environment. Given a single view RGB image, both reconstruction and texture generation are ill-posed problems. Thus, we proposed to co-learn the depth cues (using depth images obtained from affordable sensors like Kinect) with RGB images while training the network. This helps the network learn the space of complex body poses, which otherwise is difficult with just 2D content in RGB images. Although we propose to learn the reconstruction network with multi-view RGB and depth images (shown one at a time during training), co-learning them with shared filters enabled us to recover 3D volumetric shapes using just a single RGB image at test time. Apart from the challenge of non-rigid poses, the depth information also helps addressing the challenges caused by cluttered background, shape variations and free form clothing. Our texture recovery network uses a variational auto- encoder to generate orthographic texture images of reconstructed body models that are subsequently backprojected to recover a texture 3D mesh model.

BMVC Pipeline

Proposed end-to-end deep learning pipeline for reconstructing textured non-rigid 3D human body models. Using a single view perspective RGB image (a), we perform a voxelized 3D reconstruction (c) using the reconstruction network (b). Then, to add texture to the generated 3D model, we first convert the voxels to a mesh representation using Poisson’s surface reconstruction algorithm [9] and capture its four orthographic depth maps (d). These are fed as an input to the texture recovery network (f), along with the perspective RGB views used for reconstruction (a). The texture recovery network produces orthographic RGB images (g), that are back-projected onto the reconstructed model, to obtain the textured 3D model (h).


  • We introduce a novel deep learning pipeline to obtain textured 3D models of nonrigid human body shapes from a single image. To the best of our knowledge, obtaining the reconstruction of non-rigid shapes in a volumetric form (whose advantages we demonstrate) has not yet been attempted in literature. Further, this would be an initial effort in the direction of single view non- rigid reconstruction and texture recovery in an end-to-end manner.
  • We demonstrate the importance of depth cues (used only at train time) for the task of non-rigid reconstruction. This is achieved by our novel training methodology of alternating RGB and D in order to capture the large space of pose and shape deformation.
  • We show that our model can partially handle non-rigid deformations induced by free form clothing, as we do not impose any model constraint while training the volumetric reconstruction network.
  • We proposed to use depth cues for texture recovery in the variational autoencoder setup. This is the first attempt to do so in texture synthesis literature.

Related Publication:

  • Abhinav Venkat, Sai Sagar Jinka, Avinash Sharma - Deep Textured 3D reconstruction of human bodies, British Machine Vision Conference (BMVC2018). [pdf]

Bringing Semantics in Word Image Representation


In the domain of document images, for decades we have appreciated the need for learning a holistic word level representation, that is popularly used for the task of word spotting. However, in the past we have restricted the representations which only respect the word form, while completely ignoring its meaning. In this work, we attempt to bridge this gap by encoding the notion of semantics by introducing two novel form of word image representations. The first form learns an inflection invariant representation, thereby focusing on the root of the word, while the second form is built along the lines of recent textual word embedding techniques such as  Word2Vec and has much broader scope for semantic relationships. We observe that such representations are useful for both traditional word spotting and also enrich the search results by accounting the semantic nature of the task.




  • We propose a normalized word representation which is invariant to word form inflections. The representation achieves state of the art performance as compared to a conventional verbatim based representation.
  • We introduce a novel semantic representation for word images which respects both its form and meaning, thereby reducing the vocabulary gap that exists between the query and its retrieved results.
  • We demonstrate semantic word spotting task and evaluate the proposed representation on standard IR measures such word analogy and word similarity. 
  • The proposed representation is successfully demonstrated on both historical and modern document collections in printed and handwritten domains across Latin and Indic scripts.

Codes and Dataset :

Will be uploaded soon....
Please contact at This email address is being protected from spambots. You need JavaScript enabled to view it.

Qualitative Results


Related Papers