IIIT-INDIC-HW-WORDS: A Dataset for Indic Handwritten Text Recognition
Handwritten text recognition for Indian languages is not yet a well-studied problem. This is primarily due to the unavailability of large annotated datasets in the associated scripts. Existing datasets are small in size. They also use small lexicons. Such datasets are not sufficient to build robust solutions to HTR using modern machine learning techniques. In this work, we introduce a large-scale handwritten dataset for Indic scripts, referred to as the IIIT-INDIC-HW-WORDS dataset. The dataset consists of 872K handwritten instances written by 135 writers in 8 Indic scripts. With the newly introduced dataset and our earlier datasets IIIT-HW-DEV and IIIT-HW-TELUGU in Devanagari and Telugu respectively, the IIIT-INDIC-HW-WORDS dataset contains annotated hand-written word instances in all 10 prominent Indic scripts.
We further establish a high baseline for text recognition in eight Indic scripts. Our recognition scheme follows the contemporary design principles from other recognition literature, and yields competitive results on English. We further (i) study the reasons for changes in HTR performance across scripts (ii) explore the utility of pre-training for Indic HTRs. We hope our efforts will catalyze research and fuel applications relatedto handwritten document understanding in Indic scripts.
Santhoshini Gongidi, C V Jawahar, INDIC-HW-WORDS: A Dataset for IndicHandwritten Text Recognition International Conference on Document Analysis and Recognition (ICDAR) 2021
For any queries about the dataset, please contact the authors below: