Development of Annotation Guidelines, Datasets and Deep Networks for Palm Leaf Manuscript Layout Understanding

Sowmya Aitha

Abstract

Ancient paper documents and palm leaf manuscripts from the Indian subcontinent have made a significant contribution to the world literary and culture. These documents often have complex, uneven, and irregular layouts. The process of digitization and deciphering the content from these documents without human intervention pose difficulties in a broad range of areas, including language, script, layout, elements, position, and number of manuscripts per image. Large-scale annotated Indic manuscript image datasets are needed for this kind of research. In order to meet this objective, we present Indiscapes, the first dataset containing multi-regional layout annotations for ancient Indian manuscripts. We also adapt a fully convolutional deep neural network architecture for fully automatic, instance-level spatial layout parsing of manuscript images in order to deal with the challenges such as presence of dense, irregular layout elements, pictures, multiple documents per image and the wide variety of scripts. Eventually, We demonstrate the effectiveness of proposed architecture on images from the Indiscapes dataset. Despite advancements, the segmentation of semantic layout using typical deep network methods is not resistant to the complex deformations that are observed across semantic regions. This problem is particularly evident in the domain of Indian palm-leaf manuscripts, which has limited resources. Therefore, we present Indiscapes2, a new expansive dataset of various Indic manuscripts with semantic layout annotations, to help address the issue. Indiscapes2 is 150% larger than Indiscapes and contains materials from four different historical collections. In addition, we propose a novel deep network called Palmira for reliable, deformation-aware region segmentation in handwritten manuscripts. As a performance metric, we additionally report a boundary-centric measure called Hausdorff distance and its variations. Our tests show that Palmira offers reliable layouts and outperforms both strong baseline methods and ablative versions. We also highlight our results on Arabic, South-East Asian and Hebrew historical manuscripts to showcase the generalization capability of PALMIRA. Even though we have reliable deep-network based approaches for comprehending manuscript layout, these models implicitly assume one or two manuscripts per image during the process, whereas in a real-world scenario there are often cases where multiple manuscripts are typically scanned together into a scanned image to maximise scanner surface area and reduce manual labour. Now, making sure that each individual manuscript within a scanned image can be isolated (segmented) on a per-instance basis became the first essential step in understanding the content of a manuscript. Hence, there is a need for a precursor system which extracts individual manuscripts before downstream processing. The highly curved and deformed boundaries of manuscripts, which frequently cause them to overlap with each other, introduce another complexity when confronting issue. We introduce another new document image dataset named IMMI (Indic Multi Manuscript Images) to address these issues. We also present a method that generates synthetic images to augment sourced non-synthetic images in order to boost the efficiency of the dataset and facilitate deep network training. Adapted versions of current document instance segmentation frameworks are used in our experiments. The results demonstrate the efficacy of the new frameworks for the task. Overall, our contributions enable robust extraction of individual historical manuscript pages. This in turn, could potentially enable better performance on downstream tasks such as region-level instance segmentation, optical character recognition and word-spotting in historical Indic manuscripts at scale.

Year of completion:	May 2023
Advisor :	Ravi Kiran Sarvadevabhatla