Dataset
We use our benchmark dataset, IIIT-INDIC-HW for this competition only for training and validation purposes. The dataset contains word images of ten different Indic scripts such as Bengali, Devanagari, Gujarati, Gurumukhi, Kannada, Odia, Malayalam, Tamil, Telugu, and Urdu . The user can use additional data for training purposes. But they need to provide proper information about the additional data for training.
Script / Language | Training Set | Validation Set | Test |
---|---|---|---|
Bengali | 82554 | 12947 | 18574 |
Devanagari | 69583 | 12708 | 13869 |
Gujarati | 82563 | 17643 | 17313 |
Gurumukhi | 81042 | 13627 | 18947 |
Kannada | 73517 | 13752 | 16730 |
Malayalam | 85270 | 11878 | 20635 |
Odia | 73400 | 11217 | 17851 |
Tamil | 75736 | 11598 | 17184 |
Telugu | 80637 | 19980 | 18898 |
Urdu | 71207 | 13906 | 15957 |
Input and output Format Specifications
Training and validation sets contain word images (in ‘.jpg’ format) and corresponding ground truth transcriptions are available in ‘train.txt’ and ‘val.txt’, respectively. ‘train.txt’ and ‘val.txt’ contain the path of training and validation images along with ground truth transcriptions corresponding to each word image, separated by a white space. The output should be saved as ‘script name_result.txt’ (e.g., bengali_result.txt) which contains names of test word images and corresponding predictions separated by a white space in each line.
Training Dataset
Training Dataset can be downloaded
Validation Dataset
Validation Dataset can be downloaded
Test Dataset
Test Dataset can be downloaded