Towards Machine-understanding of Document Images

Minesh Mathew

Abstract

Imparting machines the capability to understand documents like humans do is an AI-complete problem since it involves multiple sub-tasks such as reading unstructured and structured text, understanding graphics and natural images, interpreting visual elements such as tables and plots, and parsing the layout and logical structure of the whole document. Except for a small percentage of documents in structured electronic formats, a majority of the documents used today, such as documents in physical mediums, born-digital documents in image formats, and electronic documents like PDFs, are not readily machine readable. A paper-based document can easily be converted into a bitmap image using a flatbed scanner or a digital camera. Consequently, machine understanding of documents in practice requires algorithms and systems that can process document images—a digital image of a document. Successful application of deep learning-based methods and use of large-scale datasets significantly improved the performance of various sub-tasks that constitute the larger problem of machine understanding of document images. Deep-learning based techniques have successfully been applied to the detection, and recognition of text and detection and recognition of various document sub-structures such as forms and tables. However, owing to the diversity of documents in terms of language, modality of text present (typewritten, printed, handwritten or born-digital), images and graphics (photographs, computer graphics, tables, visualizations, and pictograms), layout and other visual cues, building generic-solutions to the problem of machine understanding of document images is a challenging task. In this thesis, we address some of the challenges in this space, such as text recognition in low-resource languages, information extraction from historic/handwritten collections and multimodal modeling of complex document images. Additionally, we introduce new tasks that call for a top-down perspective— to understand a document image as a whole, not in parts—of document image understanding, different from the mainstream trend where the focus has been on solving various bottom-up tasks. Most of the existing tasks in Document Image Analysis (DIA) deal with independent bottom-up tasks that aim to get a machine-readable description of certain, pre-defined document elements at various abstractions such as text tokens or tables. This thesis motivates a purpose-driven DIA wherein a document image is analyzed dynamically, subject to a specific requirement set by a human user or an intelligent agent. We first consider the problem of making document images printed in low-resource languages machine-readable using an OCR and thereby making these documents AI-ready. To this end, we propose to use an end-to-end neural network model that can directly transcribe a word or line image from a document to corresponding Unicode transcription. We analyze how the proposed setup overcomes many challenges to text recognition of Indic languages. Results of our synthetic to real transfer learning experiments for text recognition demonstrate that models pre-trained on synthetic data and further fine-tuned on a portion of the real data perform as well as models trained purely on real data. For 10+ languages for which there have not been public datasets for printed text recognition, we introduce a new dataset that has more than one million word images in total. We further conduct an empirical study to compare different end-to-end neural network architectures for word and line recognition of printed text. Another significant contribution of this thesis is the introduction of new tasks that require a holistic understanding of document images. Different from existing tasks in Document Image Analysis (DIA) that attempt to solve independent bottom-up tasks, we motivate a top-down perspective of DIA that requires a holistic understanding of the image and purpose-driven information extraction. To this end, we propose two tasks—DocVQA and InfographicVQA— fashioned along Visual Question Answering (VQA) in computer vision. For DocVQA, we show results using multiple strong baselines that are adapted from existing models for existing VQA and QA problems. For InfographicVQA, we propose a transformer-based, BERT-like model that jointly models multimodal—vision, language, and layout—input. We conduct open challenges for both tasks, attracting hundreds of submissions so far. Next, we work on the problem of information extraction from a document image collection. Recognizing text from historical and/or handwritten manuscripts is a major challenge to information extraction from such collections. Similar to open-domain QA in NLP, we propose a new task in the context of document images that seek to get answers for natural language questions asked on collections of manuscripts. We propose a two-stage retrieval-based approach for the problem that uses deep features of word images and textual words. Our approach is recognition-free and returns image snippets as answers to the questions. Although our approach is recognition-free and consequently oblivious to the semantics of the text in the documents, it can look for documents or document snippets that are lexically similar to the question. We show that our approach is a reasonable alternative when using text-based QA models is infeasible due to the difficulty in recognizing text in the document images.

Year of completion:	January 2024
Advisor :	C V Jawahar