IndicSceneText2017 Dataset

- This dataset comprises of natural images containing 'scene text'.  
- There are around 1000 images for each of the three Indic scripts under consideration - Devanagari, Telugu and Malayalam
- Images are annotated at word level, with each word images bounding boxes and the corresponding 'text' ground truth
- For the task of scene text recogntion researchers might want to use the cropped word images available inside the 'cropped' folder.
Annotation Details
------------------------------
- Scene images are inside the respective script folders
- Word bounding boxes and the corresponding 'text' annotation is availble in an xml file 
	- for example for image 200.jpeg annoations are in 200.xml
- For cropped word images see the folder 'cropped'
	- list of word images and corresponding groundtruth is listed in WordImagesList.txt
	- A few word images having some special characters are removed from the above list. If you want to use all the word images, you may parse the xml annotation files directly.
   - In our work, benchmarking the text recogntion we use only the images listed in WordImagesList.txt file


Disclaimer
------------
the images are downloaded from internet . Some of them are copyrighted images.
The dataset is intended only for research purpose concerning text detection and recogntion in scene images.

Related Publication
--------------------
If you use this dataset in your research work, please cite the below publication:
 @inproceedings{IndicSceneText2017,
    title={LBenchmarking Scene Text Recognition in Devanagari, Telugu and Malayalam },
    author={ Mathew, Minesh and Jain, Mohit and  Jawahar, C.~V.},
    booktitle={ICDAR-MOCR Workshop},
    year={2017}
}