Towards Handwriting Recognition and Search in Indic & Latin Scripts

Santhoshini Gongidi

Abstract

ML-powered document image analysis approaches can enable intelligent solutions in bringing handwritten information into the digital world. Two major components of handwriting understanding include handwritten text recognition(HTR) and handwritten search. The former task enables the conversion of handwritten text to digital format, whereas the latter task provides easy access to the handwritten information scattered across books, archives, manuscripts and so on. We primarily focus on both these problem statements in this thesis. Handwritten document image analysis for Indic scripts is still in its nascent stage compared to Latin scripts. For example, many commercial applications and open-source demonstrations are available for Latin scripts and there are hardly such known applications for Indic scripts. It is challenging to develop solutions for Indic scripts due to (i) variety of scripts within the Indic subcontinent, (ii) lack of huge annotated datasets and challenges in collecting data for multiple scripts, and (iii) inherent challenges in Indic scripts like inflections, joining multiple glyphs to form an akshara(equivalent to a character in Latin scripts). While challenging, it is also crucial to develop HTR and handwritten search approaches for Indic scripts. Therefore, this thesis is majorly focused on discussing approaches for Indic HTR and Indic handwritten search. In the last two decades, large digitization projects converted paper documents and ancient historical manuscripts into digital forms. However, they remain often inaccessible due to the unavailability of robust HTR solutions. Recognizing handwritten text is fundamental to any modern document analysis system. In recent years, efforts toward developing text recognition systems have advanced due to the success of deep neural networks and the availability of annotated datasets. This is especially true for Latin scripts. In this thesis, we discuss the standard text recognition pipeline that comprises various neural network modules. Then, we present a simple and effective way to improve the text recognition pipeline and training approach. We report the improvement from our approach on four benchmark datasets in Latin and Indic scripts. The existing state-of-the-art approaches for Latin HTR and Latin handwritten search are highly datadriven. Due to the lack of availability of large-scale data, developing Indic HTR and Indic handwritten search is challenging. Therefore, we release a collective Indic handwritten dataset with text images from majorly spoken 10 Indic scripts. We establish a high baseline for text recognition in prominent Indic scripts. Our recognition scheme follows the contemporary design principles from other recognition literature, and yields competitive results on English. We also explore the utility of pre-training for Indic HTRs. We hope our efforts will catalyze research and fuel applications related to handwritten document understanding in Indic scripts. Finally, we investigate the problem of handwritten search and retrieval for unlabeled collections. Handwritten search pipelines are needed in online platforms like E-libraries and digital archives. Such pipelines can efficiently search through handwritten collections and present relevant results, much like Google Search. With its ease of access and time-saving capability, the handwritten search application can prove to be valuable to many communities that study such historical documents. In this thesis, we present one such pipeline for handwritten search that performs retrieval on new and unseen collections. The proposed retrieval is not fine-tuned for specific writing styles or unknown vocabulary in the new collection. Therefore, it can be applied to new unlabeled collections.

Year of completion:	November 2022
Advisor :	C V Jawahar