IndicSceneText2017 Dataset - This dataset comprises of natural images containing 'scene text'. - There are around 1000 images for each of the three Indic scripts under consideration - Devanagari, Telugu and Malayalam - Images are annotated at word level, with each word images bounding boxes and the corresponding 'text' ground truth - For the task of scene text recogntion researchers might want to use the cropped word images available inside the 'cropped' folder. Annotation Details ------------------------------ - Scene images are inside the respective script folders - Word bounding boxes and the corresponding 'text' annotation is availble in an xml file - for example for image 200.jpeg annoations are in 200.xml - For cropped word images see the folder 'cropped' - list of word images and corresponding groundtruth is listed in WordImagesList.txt - A few word images having some special characters are removed from the above list. If you want to use all the word images, you may parse the xml annotation files directly. - In our work, benchmarking the text recogntion we use only the images listed in WordImagesList.txt file Disclaimer ------------ the images are downloaded from internet . Some of them are copyrighted images. The dataset is intended only for research purpose concerning text detection and recogntion in scene images. Related Publication -------------------- If you use this dataset in your research work, please cite the below publication: @inproceedings{IndicSceneText2017, title={LBenchmarking Scene Text Recognition in Devanagari, Telugu and Malayalam }, author={ Mathew, Minesh and Jain, Mohit and Jawahar, C.~V.}, booktitle={ICDAR-MOCR Workshop}, year={2017} }