Scene Text Recognition for Indian Languages

Sanjana Gunna

Abstract

Text recognition has been an active field in computer vision even before the beginning of the deep learning era. Due to the varied applications of recognition models, the research area has been classified into diverse categories based on the domain of the data used. Optical character recognition (OCR) is focused on scanned documents, whereas images with natural scenes and much complex backgrounds fall into the category of scene text recognition. Scene text recognition has become an exciting area of research due to the complexities and difficulties such as complex backgrounds, improper illumination, distorted images with noise, inconsistent usage of fonts and font sizes that are not usually horizontally aligned. Such cases make the task of scene text recognition more complicated and challenging. In recent years, we have observed the rise of deep learning. Subsequently, there has been an incremental growth in the recognition algorithms and datasets available for training and testing purposes. This surge has caused the performance of recognizing text in natural scenes to rise above the baseline models that were previously trained using hand-crafted features. Latin texts were the center of attention in most of these works and did not profoundly investigate the field of scene text recognition for non-Latin languages. Upon scrutiny, we observe that the performance of the current best recognition models has reached above 90% over scene text benchmark datasets. However, these recognition models do not perform as well on non-Latin languages as they did on Latin (or English) datasets. This striking difference in the performances over different languages is a rising concern among the researchers focusing on lowresource languages, and it is indeed the motivation behind our work. Scene text recognition in low-resource non-Latin languages is difficult and challenging due to the inherent complex scripts, multiple writing systems, various fonts and orientations. Despite such differences, we can also achieve Latin (English) text-like performance for low-resource non-Latin languages. In this thesis, we look at all the parameters involved in the process of text recognition and determine the importance of those parameters through thorough experiments. We use synthetic data for controlled experiments where we test the parameters as mentioned earlier in an isolated fashion to effectively identify the catalysts of text recognition. We analyse the complexity of the scripts via these synthetic data experiments. We present the results of our experiments on two baseline models, CRNN and STAR-Net models, on available datasets to ensure generalisability. In addition to this, we also propose an error correction module for correcting the labels by utilizing the training data of real test datasets. To further improve the results on real test datasets, we propose transfer learning from English to exploit the abundant data that is available for learning. We show that the transfer from English is not feasible, and it actually lowers the performance of the individual language models. Due to the failure of English transfer experiments, we shift our focus onto just the Indian languages and examine the characteristics of each language via character n-gram plots, visual features like vowel signs, conjunct characters and other word statistics. They also share a resemblance to each other concerning certain other factors. We then propose to apply transfer learning across languages to enhance the performance of the language models. We depict the improvement on real datasets because of the transfers among Indian languages that are visually closer or sometimes better than the individual models. The transfers among languages prove to be much more profitable than transfers from English. We comprehend the significance of the variety and number of fonts during data generation via synthetic data experiments on English test datasets. Synthetic data embodies various fonts to ensure diversity of data to create robust recognition systems. In order to strengthen data diversity, we incorporate over 500 Hindi fonts (including Unicode and non-Unicode fonts) into the synthetic data for improved performance on Hindi real test datasets. We also manifest the process to utilize and incorporate nonUnicode fonts of Indian languages into the training process error-free. In addition to these fonts, we make specific changes to encompass an augmentation pipeline that adds to the diversity of data. We utilize more than nine augmentation techniques to boost the performance of Hindi STR systems. We achieved significant improvements over previous works with our evaluations over natural settings. Through our experiments, we set new benchmark accuracies for STR on Hindi, Telugu, and Malayalam languages from the IIIT-ILST dataset by gaining 6%, 5%, and 2% gains in Word Recognition Rates (WRRs) compared to previous works. Similarly, we also achieved a 23% improvement in WRR for the Bangla language from the MLT-17 dataset. We further improve this result by incorporating the error correction module as mentioned above into the training pipeline. In addition to this, we also released two STR datasets for Gujarati and Tamil datasets, containing 440 scene images, further divided into 500 Gujarati and 2535 Tamil cropped word images. We report a 5% and 3% gain in WRR over our baseline models for Gujarati and Tamil, respectively. We also establish benchmark results for MLT-19 and Bangla datasets with 8% and 4% improvements in WRRs over baselines. Further enriching the synthetic dataset with non-Unicode fonts and multiple augmentations helps us achieve a remarkable Word Recognition Rate gain of over 33% on the IIIT-ILST Hindi dataset. Additionally, we implement a lexicon-based transcription approach that utilizes a dynamic lexicon for each image while testing and presenting the results for languages mentioned above. Keywords – Scene text recognition · transfer learning · photo OCR · multilingual OCR · Indian Languages · Indic OCR · Synthetic Data · Data Diversity

Year of completion:	June 2022
Advisor :	C V Jawahar