Interactive Layout Parsing of Highly Unstructured Document Images

Abhishek Trivedi

Abstract

Ancient historical handwritten documents were one of the earliest forms of written media and contributed to the most valuable cultural and natural heritage of many countries globally. They hold early written knowledge about subjects like science, medicine, Buddhist doctrines and astrology. India has the most extensive collection of manuscripts and studies have been conducted on their digitization for passing on their wealth of wisdom to future generations. Targeted annotation systems for automatic multi-region instance segmentation of their document images exist but with relatively inferior quality layout prediction compared to their human-annotated counterparts. Precise boundary annotations of image regions in historical document images are crucial for downstream applications like OCR, which rely on region-class semantics. Some document collections contain densely laid out, highly irregular, and overlapping multi-class region instances with a large range in aspect ratio. Addressing this, a web-based layout annotation and analytics system is proposed in this thesis. The system, called HInDoLA, features an intuitive annotation GUI, a graphical analytics dashboard, and interfaces with machine-learning-based intelligent modules on the backend. HInDoLA has successfully helped us create the first-ever large-scale dataset for layout parsing of Indic palm-leaf manuscripts named Indiscapes. Keeping the non-technical nature of domain experts in mind, the tool offers an interactive and relatively fast annotation process with the help of two modes, namely Fully Automatic mode and Semi-Supervised Intelligent Mode. We then discuss the semi-supervised approach superiority over fully automatic approaches for Historical document annotation. Fully automatic boundary estimation approaches tend to be data-intensive, cannot handle variable-sized images, and produce sub-optimal results for images mentioned above. BoundaryNet, a novel resizing-free approach for high-precision semi-automatic layout annotation, is another main contribution from this thesis. An attention-guided skip network first processes the variablesized user-selected region of interest. The network optimization is guided via Fast Marching distance maps to obtain a good quality initial-boundary estimate and an associated feature representation. These outputs are processed by a Residual Graph Convolution Network optimized using Hausdorff loss to get the final region boundary. A challenging image manuscript dataset demonstrates that BoundaryNet outperforms solid baselines and produces high-quality semantic region boundaries. Qualitatively, our approach generalizes across multiple document image datasets containing different script systems and layouts, all without additional fine-tuning.

Year of completion:	May 2022
Advisor :	Ravi Kiran Sarvadevabhatla