IIIT-INDIC-HW-WORDS: A Dataset for Indic Handwritten Text Recognition


Abstract:

Handwritten text recognition for Indian languages is not yet a well-studied problem. This is primarily due to the unavailability of large annotated datasets in the associated scripts. Existing datasets are small in size. They also use small lexicons. Such datasets are not sufficient to build robust solutions to HTR using modern machine learning techniques. In this work, we introduce a large-scale handwritten dataset for Indic scripts, referred to as the IIIT-INDIC-HW-WORDS dataset. The dataset consists of 872K handwritten instances written by 135 writers in 8 Indic scripts. With the newly introduced dataset and our earlier datasets IIIT-HW-DEV and IIIT-HW-TELUGU in Devanagari and Telugu respectively, the IIIT-INDIC-HW-WORDS dataset contains annotated hand-written word instances in all 10 prominent Indic scripts.

 

We further establish a high baseline for text recognition in eight Indic scripts. Our recognition scheme follows the contemporary design principles from other recognition literature, and yields competitive results on English. We further (i) study the reasons for changes in HTR performance across scripts (ii) explore the utility of pre-training for Indic HTRs. We hope our efforts will catalyze research and fuel applications relatedto handwritten document understanding in Indic scripts.

Dataset samples

 IIIT INDIC HW WORDS

Dataset

.

 Language  Download link 
Bengali Overview Link
Gujarati Overview Link
Gurumukhi Overview Link
Kannada Overview Link
Odiya Overview Link
Malayalam Overview Link
Tamil Overview Link
Urdu Overview Link

 


Related Publications

Santhoshini Gongidi, C V Jawahar, INDIC-HW-WORDS: A Dataset for IndicHandwritten Text Recognition International Conference on Document Analysis and Recognition (ICDAR) 2021

Contact

For any queries about the dataset, please contact the authors below:

Santhoshini Gongidi: This email address is being protected from spambots. You need JavaScript enabled to view it.