Dataset

We use our benchmark dataset, IIIT-INDIC-HW for this competition only for training and validation purposes. The dataset contains word images of ten different Indic scripts such as Bengali, Devanagari, Gujarati, Gurumukhi, Kannada, Odia, Malayalam, Tamil, Telugu, and Urdu . The user can use additional data for training purposes. But they need to provide proper information about the additional data for training.

Script / Language Training Set Validation Set Test
Bengali 82554 12947 18574
Devanagari 69583 12708 13869
Gujarati 82563 17643 17313
Gurumukhi 81042 13627 18947
Kannada 73517 13752 16730
Malayalam 85270 11878 20635
Odia 73400 11217 17851
Tamil 75736 11598 17184
Telugu 80637 19980 18898
Urdu 71207 13906 15957
Statistics of Training and Validation Datasets

Input and output Format Specifications

Training and validation sets contain word images (in ‘.jpg’ format) and corresponding ground truth transcriptions are available in ‘train.txt’ and ‘val.txt’, respectively. ‘train.txt’ and ‘val.txt’ contain the path of training and validation images along with ground truth transcriptions corresponding to each word image, separated by a white space. The output should be saved as ‘script name_result.txt’ (e.g., bengali_result.txt) which contains names of test word images and corresponding predictions separated by a white space in each line.


Training Dataset

Training Dataset can be downloaded


Validation Dataset

Validation Dataset can be downloaded


Test Dataset

Test Dataset can be downloaded