A Rich 3D Dataset of Scanned Humans





SHARP-3DHumans dataset provides around 250 meshes of people in diverse body shapes in various garments styles andsizes. We cover a wide variety of clothing styles, ranging from loose robed clothing, like saree (a typical South-Asian dress) to relatively tight fit clothing, like shirt and trousers. The dataset consists of around 150 male and 50 female unique subjects. Total male scans are around 180 and total female scans are around 70. In terms of regional diversity, for the first time, we capture body shape,appearance and clothing styles for South-Asian population. For each 3D mesh, pre-registered SMPL body is also included.



The dataset is collected using Artec3D Eva hand held structured light scanner. The scanner has 3D point accuracy up to 0.1 mm and 3D reso-lution of 0.5 mm, enabling capture of high frequency geometrical details, alogwith high resolution texture maps. The subjects were scanned in a studio environment with controlled lighting and uniform illumination.

Technical Paper

[To be due...]


Download Video

Please click here to download

Download Sample

Please click here to download

Classroom Slide Narration System

Jobin K.V., Ajoy Mondal, and C.V. Jawahar

IIIT Hyderabad       {jobin.kv@research., ajoy.mondal@, and jawahar@}iiit.ac.in

[ Code ]   | [ Demo Video ]   | [ Dataset ]

banner style3

The architecture of the proposed cssnet for classroom slide seg-mentation. The network consists of three modules — (i) attention module (up-per dotted region), (ii) multi-scale feature extraction module (lower region), (iii)feature concatenation module.


Slide presentations are an effective and efficient tool used by the teaching community for classroom communication. However, this teaching model can be challenging for the blind and visually impaired (VI) students. The VI student required a personal human assistance for understand the presented slide. This shortcoming motivates us to design a Classroom Slide Narration System (CSNS) that generates audio descriptions corresponding to the slide content. This problem poses as an image-to-markup language generation task. The initial step is to extract logical regions such as title, text, equation, figure, and table from the slide image. In the classroom slide images, the logical regions are distributed based on the location of the image. To utilize the location of the logical regions for slide image segmentation, we propose the architecture, Classroom Slide Segmentation Network (CSSN). The unique attributes of this architecture differs from most other semantic segmentation networks. Publicly available benchmark datasets such as WiSe and SPaSe are used to validate the performance of our segmentation architecture. We obtained 9.54% segmentation accuracy improvement in WiSe dataset. We extract content (information) from the slide using four well-established modules such as optical character recognition (OCR), figure classification, equation description, and table structure recognizer. With this information, we build a Classroom Slide Narration System (CSNS) to help VI students understand the slide content. The users have given better feedback on the quality output of the proposed CSNS in comparison to existing systems like Facebook's Automatic Alt-Text (AAT) and Tesseract.


  • Paper
    Classroom Slide Narration System

    Jobin K.V., Ajoy Mondal, and Jawahar C.V.
    Classroom Slide Narration System , CVIP, 2021.
    [PDF ] |

    Updated Soon



  1. Jobin K.V. - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Ajoy Mondal - This email address is being protected from spambots. You need JavaScript enabled to view it.
  3. Jawahar C.V. - This email address is being protected from spambots. You need JavaScript enabled to view it.

Audio-Visual Speech Super-Resolution

Rudrabha Mukhopadhyay*, Sindhu B Hegde* , Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

BMVC, 2021 (Oral)

[ Code ]   | [ Demo Video ]

banner style3

We present an audio-visual model for super-resolving very low-resolution speech inputs (example, 1kHz) at large scale-factors. In contrast to the existing audio-only speech super-resolution approaches, our method benefits from the visual stream, either the real-visual stream (if available), or the generated visual stream from our pseudo-visual network.


In this paper, we present an audio-visual model to perform speech super-resolution at large scale-factors (8x and 16x). Previous works attempted to solve this problem using only the audio modality as input and thus were limited to low scale-factors of 2x and 4x. In contrast, we propose to incorporate both visual and auditory signals to super-resolve speech of sampling rates as low as 1kHz. In such challenging situations, the visual features assist in learning the content and improves the quality of the generated speech. Further, we demonstrate the applicability of our approach to arbitrary speech signals where the visual stream is not accessible. Our "pseudo-visual network" precisely synthesizes the visual stream solely from the low-resolution speech input. Extensive experiments and the demo video illustrate our method's remarkable results and benefits over state-of-the-art audio-only speech super-resolution approaches.


  • Paper
    Audio-Visual Speech Super-Resolution

    Rudrabha Mukhopadhyay*, Sindhu B Hegde*, Vinay Namboodiri and C.V. Jawahar
    Audio-Visual Speech Super-Resolution, BMVC, 2021 (Oral).
    [PDF ] |

    Updated Soon




  1. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.

Handwritten Text Retrieval from Unlabeled Collections

Santhoshini Gongidi and C.V. Jawahar

[ Paper ]   | [ Demo ]


Handwritten documents from communities like cultural heritage, judiciary, and modern journals remain largely unexplored even today. To a great extent, this is due to the lack of retrieval tools for such unlabeled document collections. In this work, we consider such collections and present a simple, robust retrieval framework for easy information access. We achieve retrieval on unlabeled novel collections through invariant features learnt for handwritten text. These feature representations enable zero-shot retrieval for novel queries on unexplored collections. We improve the framework further by supporting search via text and exemplar queries. Four new collections written in English, Malayalam, and Bengali are used to evaluate our text retrieval framework. These collections comprise 2957 handwritten pages and over 300K words. We report promising results on these collections, despite the zero-shot constraint and huge collection size. Our framework allows the addition of new collections without any need for specific finetuning or labeling. Finally, we also present a demonstration of the retrieval framework.

Demo link: HW-Search

Teaser Video:

Related Publications

Santhoshini Gongidi, C V Jawahar, Handwritten Text Retrieval from Unlabeled Collections, CVIP 2021


For any queries about the work, please contact the authors below

  1. Santhoshini Gongidi: This email address is being protected from spambots. You need JavaScript enabled to view it.

MeronymNet: A Hierarchical Model for Unified and Controllable Multi-Category Object Generation

Rishabh Baghel, Abhishek Trivedi, Tejas Ravichandran, and Ravi Kiran Sarvadevabhatla

[Project Page Link]   [Paper]   [ GitHub]



  • We introduce MeronymNet, a novel hierarchical approach for controllable, part-based generation of multi-category objects using a single unified model.
  • We adopt a guided coarse-to-fine strategy involving semantically conditioned generation of bounding box layouts, pixel-level part layouts and ultimately, the object depictions themselves.
  • We use Graph Convolutional Networks, Deep Recurrent Networks along with custom-designed Conditional Variational Autoencoders to enable flexible, diverse and category-aware generation of 2-D objects in a controlled manner.
  • The performance scores for generated objects reflect MeronymNet's superior performance compared to multiple strong baselines and ablative variants.
  • We also showcase MeronymNet's suitability for controllable object generation and interactive object editing at various levels of structural and semantic granularity.


meronymnet results 1 Look at sample generations by MeronymNet. For each sample, the generated bounding box, corresponding label mask and the RGB object can be seen. Notice the diversity in number of parts, appearance and viewpoint among the generated objects.

Application Scenario: Interactive Modification

MeronymBot hstack 1 Our model allows users to have control on part level, which they can interact with either using boxes or masks. Notice that the viewpoint for rendering the object has changed from the initial generation to accommodate the updated part list. This scenario especially demonstrates MeronymNet’s holistic, part-based awareness of rendering viewpoints best suited for various part sets.


violinparts We use the large-scale part-segmented object dataset, PASCAL Parts. The plot shows the density distribution of part counts in object instances for each category. The varying range and frequency of part occurrences across categories, combined with the requirement of object generation from a single unified model, poses lots of challenges.


  1. If you have any question, please contact Dr. Ravi Kiran Sarvadevabhatla at - This email address is being protected from spambots. You need JavaScript enabled to view it. .