ICFHR2022 Competition on Indic Handwriting Text Recognition

DataSet

Home
DataSet

Dataset

We use our benchmark dataset, IIIT-INDIC-HW for this competition only for training and validation purposes. The dataset contains word images of ten different Indic scripts such as Bengali, Devanagari, Gujarati, Gurumukhi, Kannada, Odia, Malayalam, Tamil, Telugu, and Urdu . The user can use additional data for training purposes. But they need to provide proper information about the additional data for training.

Script / Language	Training Set	Validation Set	Test
Bengali	82554	12947	18574
Devanagari	69583	12708	13869
Gujarati	82563	17643	17313
Gurumukhi	81042	13627	18947
Kannada	73517	13752	16730
Malayalam	85270	11878	20635
Odia	73400	11217	17851
Tamil	75736	11598	17184
Telugu	80637	19980	18898
Urdu	71207	13906	15957

Statistics of Training and Validation Datasets

Input and output Format Specifications

Training and validation sets contain word images (in ‘.jpg’ format) and corresponding ground truth transcriptions are available in ‘train.txt’ and ‘val.txt’, respectively. ‘train.txt’ and ‘val.txt’ contain the path of training and validation images along with ground truth transcriptions corresponding to each word image, separated by a white space. The output should be saved as ‘script name_result.txt’ (e.g., bengali_result.txt) which contains names of test word images and corresponding predictions separated by a white space in each line.

Training Dataset

Training Dataset can be downloaded

Validation Dataset

Validation Dataset can be downloaded

Test Dataset

Test Dataset can be downloaded