IIIT-INDIC-HW-WORDS: A Dataset for Indic Handwritten Text Recognition


Handwritten text recognition for Indian languages is not yet a well-studied problem. This is primarily due to the unavailability of large annotated datasets in the associated scripts. Existing datasets are small in size. They also use small lexicons. Such datasets are not sufficient to build robust solutions to HTR using modern machine learning techniques. In this work, we introduce a large-scale handwritten dataset for Indic scripts, referred to as the IIIT-INDIC-HW-WORDS dataset. The dataset consists of 872K handwritten instances written by 135 writers in 8 Indic scripts. With the newly introduced dataset and our earlier datasets IIIT-HW-DEV and IIIT-HW-TELUGU in Devanagari and Telugu respectively, the IIIT-INDIC-HW-WORDS dataset contains annotated hand-written word instances in all 10 prominent Indic scripts.


We further establish a high baseline for text recognition in eight Indic scripts. Our recognition scheme follows the contemporary design principles from other recognition literature, and yields competitive results on English. We further (i) study the reasons for changes in HTR performance across scripts (ii) explore the utility of pre-training for Indic HTRs. We hope our efforts will catalyze research and fuel applications relatedto handwritten document understanding in Indic scripts.

Dataset samples




 Language  Download link 
Bengali Overview Link
Gujarati Overview Link
Gurumukhi Overview Link
Kannada Overview Link
Odiya Overview Link
Malayalam Overview Link
Tamil Overview Link
Urdu Overview Link


Related Publications

Santhoshini Gongidi, C V Jawahar, INDIC-HW-WORDS: A Dataset for IndicHandwritten Text Recognition International Conference on Document Analysis and Recognition (ICDAR) 2021


For any queries about the dataset, please contact the authors below:

Santhoshini Gongidi: This email address is being protected from spambots. You need JavaScript enabled to view it.



Scene Text Recognition in Indian Scripts


This work addresses the problem of scene text recognition in India scripts. As a first step, we benchmark scene text recognition for three Indian scripts - Devanagari, Telugu and Malayalam, using a CRNN model. To this end we release a new dataset for the three scripts , comprising nearly a 1000 images in the wild which is suitable for scene text detection and recognition tasks. We train the recognition model using synthetic word images. This data is also made publicly available.


Natural scene images, having text in Indic scripts; Telugu, Malayalam and Devanagari in the clock wise order


sample word

Sample word images from the dataset of synthetic word images

Related Publications

If you use the dataset of real scene text images for the three indic scripts or the synthetic dataset comprising font rendered word images, please consider citing our paper titled Benchmarking Scene Text Recognition in Devanagari, Telugu and Malayalam

author={M. {Mathew} and M. {Jain} and C. V. {Jawahar}},
booktitle={ICDAR MOCR Workshop},
title={Benchmarking Scene Text Recognition in Devanagari, Telugu and Malayalam},

Dataset Download


IIIT-ILST contains nearly 1000 real images per each script which are annotated for scene text bounding boxes and transcriptions.

Terms and Conditions: All images provided as part of the IIIT-ILST dataset have been collected from freely accessible internet sites. As such, they are copyrighted by their respective authors. The images are hosted in this site, physically located in India. Specifically, the images are provided with the sole purpose to be used for research in developing new models for scene text recognition You are not allowed to redistribute any images in this dataset. By downloading any of the files below you explicitly state that your final purpose is to perform research in the conditions mentioned above and you agree to comply with these terms and conditions.

Onedrive Link

Synthetic Word Images Dataset

We rendered millions of word images synthetically to train scene text recognition models for the three scripts.


The synthetic images dataset provided below is different from the one we used originally in our ICDAR 2017, MOCR workshop paper. The synthetic dataset we used in the orginal work had some images with junk characters resulting from incorrect rendering of certain Unicode points while using some fonts. Hence we re-generated the synthetic word images and the one provided below is this new set.


OneDrive Link


CC BY 4.0

Synthetic Scene Text Word Images Rendering - Code

The code we used for synthetic word images for Indian languages is added here The repo includes a collection of Unicode fonts for many Indian scripts which you might find useful for works where you want to generate synthetic word images in Indian scripts and Arabic.

Visual Speech Enhancement Without a Real Visual Stream

Sindhu Hegde*, Prajwal Renukanand*   Rudrabha Mukhopadhyay*   Vinay Namboodiri   C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

[Code]   [Paper]   [Demo Video]   [Test Sets]

 WACV, 2021images

We propose a novel approach to enhance the speech by hallucinating the visual stream for any given noisy audio. In contrast to the existing audio-visual methods, our approach works even in the absence of a reliable visual stream, while also performing better than audio-only works in unconstrained conditions due to the assistance of generated lip movements.


In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over ``audio-only" methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a ``visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is almost close (< 3\% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as qualitative human evaluations. Additional ablation studies and a demo video in the supplementary material containing qualitative comparisons and results clearly illustrate the effectiveness of our approach.


  • Paper
    Visual Speech Enhancement Without A Real Visual Stream

    Sindhu Hegde*, Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
    Visual Speech Enhancement Without A Real Visual Stream, WACV, 2021.
    [PDF] |

    Updated Soon


Please click on this link : https://www.youtube.com/watch?v=y_oP9t7WEn4&feature=youtu.be


  1. Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Prajwal K R - This email address is being protected from spambots. You need JavaScript enabled to view it.
  3. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction

Sai Sagar Jinka, Rohan Chacko  Avinash Sharma  P.J. Narayanan  

IIIT Hyderabad

    [Demo Video]   

International Conference on 3D Vision, 2020


We introduce PeeledHuman - a novel shape representation of the human body that is robust to self-occlusions. PeeledHuman encodes the human body as a set of Peeled Depth and RGB maps in 2D, obtained by performing raytracing on the 3D body model and extending each ray beyond its first intersection. This formulation allows us to handle self-occlusions efficiently compared to other representations. Given a monocular RGB image, we learn these Peeled maps in an end-to-end generative adversarial fashion using our novel framework - PeelGAN. We train PeelGAN using a 3D Chamfer loss and other 2D losses to generate multiple depth values per-pixel and a corresponding RGB field per-vertex in a dual-branch setup. In our simple non-parametric solution, the generated Peeled Depth maps are back-projected to 3D space to obtain a complete textured 3D shape. The corresponding RGB maps provide vertex-level texture details. We compare our method with current parametric and non-parametric methods in 3D reconstruction and find that we achieve state-of-theart-results. We demonstrate the effectiveness of our representation on publicly available BUFF and MonoPerfCap datasets as well as loose clothing data collected by our calibrated multi-Kinect setup.


Our proposed representation encodes a human body as a set of Peeled Depth & RGB maps from a given view. These maps are backprojected to 3D space in the camera coordinate frame to recover the 3D human body.


In this paper, we tackle the problem of textured 3D human reconstruction from a single RGB image by introducing a novel shape representation, called PeeledHuman. Our proposed solution derives inspiration from the classical ray tracing approach in computer graphics. We estimate a fixed number of ray intersection points with the human body surface in the canonical view volume for every pixel in an image, yielding a multi-layered shape representation called PeeledHuman. PeeledHuman encodes a 3D shape as a set of depth maps called Peeled Depth maps. We further extend this layered representation to recover texture by capturing a discrete sampling of the continuous surface texture called Peeled RGB maps. Such a layered representation of the body shape addresses severe self-occlusions caused by complex body poses and viewpoint variations. Our representation is similar to depth peeling used in computer graphics for order-independent transparency. The proposed shape representation allows us to recover multiple 3D points that project to the same pixel in the 2D image plane. Thus, we reformulate the solution to the monocular textured 3D body reconstruction task as predicting a set of Peeled Depth & RGB maps. To achieve this dual-prediction task, we propose PeelGAN, a dual-task generative adversarial network that generates a set of depth and RGB maps in two different branches of the network. These predicted peeled maps are then back-projected to 3D space to obtain a point cloud. Our proposed representation enables an end-to-end, non-parametric and differentiable solution for textured 3D body reconstruction. It is important to note that our representation is not restricted only to human body models but can generalize well to any 3D shapes/scenes, given specific training data prior.


PeelGAN overview: The dual-branch network generates Peeled Depth (D) and RGB (R) maps from an input image. The generated maps are each fed to a discriminator: one for RGB and one for Depth maps. The generated maps are backprojected to obtain the 3D human body represented as a point cloud (p) in the camera coordinate frame. We employ a Chamfer loss between the reconstructed 3D human body represented as a point cloud (p̂) point cloud and the ground-truth point cloud (p) along with several other 2D losses on the Peeled maps.


  • We introduce PeeledHuman - a novel shape representation of the human body encoded as a set of Peeled Depth and RGB maps, that is robust to severe self-occlusions.
  • Our proposed representation is efficient in terms of both encoding 3D shapes as well as feed-forward time yielding superior quality of reconstructions with faster inference rates.
  • We propose PeelGAN - a complete end-to-end pipeline to reconstruct a textured 3D human body from a single RGB image using an adversarial approach. 
  • We introduce a challenging 3D dataset consisting of multiple human action sequences with variations in shape and pose, draped in loose clothing. We intend to release this data along with our code for academic use.

Related Publication

  • Rohan Chacko, Sai Sagar Jinka, Avinash Sharma, P.J. Narayanan - PeeledHuman: Robust Shape Representation for Textured 3D Human Body Reconstruction International Conference on 3D Vision (3DV), 2020

Dear Commissioner, please fix these: A scalable system for inspecting road infrastructure


Inspecting and assessing the quality of traffic infrastructure(such as the state of the signboards or road markings) is challenging forhumans due to (i) the massive length of roads that countries will have and(ii) the regular frequency at which this needs to be done. In this paper, wedemonstrate a scalable system that uses computer vision for automaticinspection of road infrastructure from a simple video captured from amoving vehicle. We validated our method on 1500kms of roads capturedin and around the city of Hyderabad, India. Qualitative and quantitativeresults demonstrate the feasibility, scalability and effectiveness of oursolution.


Related Publicationstract

  • Raghava Modhugu, Ranjith Reddy and C. V. Jawahar - Dear Commissioner, please fix these: A scalable system for inspecting road infrastructure ,National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2019 [pdf]


We release traffic sign boards dataset of Indian roads with 51 different classes [Dataset]