CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Summer School 2026
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Towards Machine-understanding of Document Images


Minesh Mathew

Abstract

Imparting machines the capability to understand documents like humans do is an AI-complete problem since it involves multiple sub-tasks such as reading unstructured and structured text, understanding graphics and natural images, interpreting visual elements such as tables and plots, and parsing the layout and logical structure of the whole document. Except for a small percentage of documents in structured electronic formats, a majority of the documents used today, such as documents in physical mediums, born-digital documents in image formats, and electronic documents like PDFs, are not readily machine readable. A paper-based document can easily be converted into a bitmap image using a flatbed scanner or a digital camera. Consequently, machine understanding of documents in practice requires algorithms and systems that can process document images—a digital image of a document. Successful application of deep learning-based methods and use of large-scale datasets significantly improved the performance of various sub-tasks that constitute the larger problem of machine understanding of document images. Deep-learning based techniques have successfully been applied to the detection, and recognition of text and detection and recognition of various document sub-structures such as forms and tables. However, owing to the diversity of documents in terms of language, modality of text present (typewritten, printed, handwritten or born-digital), images and graphics (photographs, computer graphics, tables, visualizations, and pictograms), layout and other visual cues, building generic-solutions to the problem of machine understanding of document images is a challenging task. In this thesis, we address some of the challenges in this space, such as text recognition in low-resource languages, information extraction from historic/handwritten collections and multimodal modeling of complex document images. Additionally, we introduce new tasks that call for a top-down perspective— to understand a document image as a whole, not in parts—of document image understanding, different from the mainstream trend where the focus has been on solving various bottom-up tasks. Most of the existing tasks in Document Image Analysis (DIA) deal with independent bottom-up tasks that aim to get a machine-readable description of certain, pre-defined document elements at various abstractions such as text tokens or tables. This thesis motivates a purpose-driven DIA wherein a document image is analyzed dynamically, subject to a specific requirement set by a human user or an intelligent agent. We first consider the problem of making document images printed in low-resource languages machine-readable using an OCR and thereby making these documents AI-ready. To this end, we propose to use an end-to-end neural network model that can directly transcribe a word or line image from a document to corresponding Unicode transcription. We analyze how the proposed setup overcomes many challenges to text recognition of Indic languages. Results of our synthetic to real transfer learning experiments for text recognition demonstrate that models pre-trained on synthetic data and further fine-tuned on a portion of the real data perform as well as models trained purely on real data. For 10+ languages for which there have not been public datasets for printed text recognition, we introduce a new dataset that has more than one million word images in total. We further conduct an empirical study to compare different end-to-end neural network architectures for word and line recognition of printed text. Another significant contribution of this thesis is the introduction of new tasks that require a holistic understanding of document images. Different from existing tasks in Document Image Analysis (DIA) that attempt to solve independent bottom-up tasks, we motivate a top-down perspective of DIA that requires a holistic understanding of the image and purpose-driven information extraction. To this end, we propose two tasks—DocVQA and InfographicVQA— fashioned along Visual Question Answering (VQA) in computer vision. For DocVQA, we show results using multiple strong baselines that are adapted from existing models for existing VQA and QA problems. For InfographicVQA, we propose a transformer-based, BERT-like model that jointly models multimodal—vision, language, and layout—input. We conduct open challenges for both tasks, attracting hundreds of submissions so far. Next, we work on the problem of information extraction from a document image collection. Recognizing text from historical and/or handwritten manuscripts is a major challenge to information extraction from such collections. Similar to open-domain QA in NLP, we propose a new task in the context of document images that seek to get answers for natural language questions asked on collections of manuscripts. We propose a two-stage retrieval-based approach for the problem that uses deep features of word images and textual words. Our approach is recognition-free and returns image snippets as answers to the questions. Although our approach is recognition-free and consequently oblivious to the semantics of the text in the documents, it can look for documents or document snippets that are lexically similar to the question. We show that our approach is a reasonable alternative when using text-based QA models is infeasible due to the difficulty in recognizing text in the document images.

Year of completion:  January 2024
 Advisor : C V Jawahar

Related Publications


    Downloads

    thesis

    Surrogate Approximations for Similarity Measures


    Nagender G

    Abstract

    This thesis targets the problem of surrogate approximations for similarity measures to improve their performance in various applications. We have presented surrogate approximations for popular dynamic time warping (DTW) distance, canonical correlation analysis (CCA), Intersection-over-Union (IoU), PCP, and PCKh measures. For DTW and CCA, our surrogate approximations are based on their corresponding definitions. We presented a surrogate approximation using neural networks for IoU, PCP, and PCKh measures.

    First, we propose a linear approximation for the naïve DTW distance. We try to speed up the DTW distance computation by learning the optimal alignment from the training data. We propose a surrogate kernel approximation over CCA in our next contribution. It enables us to use CCA in the kernel framework, further improving its performance. In our final contribution, we propose a surrogate approximation technique using neural networks to learn a surrogate loss function over IoU, PCP, and PCKh measures. For IoU loss, we validated our method over semantic segmentation models. For PCP, and PCKh loss, we validated over human pose estimation models.

     

    Year of completion:  March 2023
     Advisor : C V Jawahar

    Related Publications


      Downloads

      thesis

      !-

       ppt

      -->

      Epsilon Focus Photography: A Study of Focus, Defocus and Depth-of-field


      Parikshit Sakurikar

      Abstract

      Focus, defocus and depth-of-field are integral aspects of a photograph captured using a wide-aperture camera. Focus and defocus blur provide critical cues for estimation of scene depth and structure which helps in scene understanding or post-capture image manipulation. Focus and defocus blur are also used creatively by photographers to produce remarkable compositional effects such as emphasis on the foreground subject with aesthetic bokeh in the background. Epsilon Focus Photography is a branch of computational photography that deals with the capture and processing of multi-focus imagery - where multiple wide-aperture images are captured with a small change in focus position. In this thesis, we provide a comprehensive study of various problems in epsilon focus photography along with a detailed analysis of the related work in the area. We provide useful constructs for the understanding and manip- ulation of focus, defocus blur and the depth-of-field of an image. The work in this thesis can be divided into four broad categories of measurement, representation, manipulation and applications of focus. Measuring focus is a long studied and challenging problem in computer vision. We study various methods to measure focus and propose a composite measure of focus that combines the strengths of well-known focus measures. We the study the task of post-capture focus manipulation at each pixel in an image and formulate a novel representation of focus that can find much use in image editing toolkits. Our representation can faithfully encode the fine characteristics of a wide-aperture image even at complex interaction locations such as depth-edges and over-saturated background regions, while optimizing the memory footprint of multi-focus imagery. Apart from precise geometric constructs for scene refocusing, we also propose an data-driven approach for post-capture scene refocusing using deep adversarial learning. We show how the tasks of deblurring an image, magnification of the defocused content and overall comprehensive focus manipulation can be efficiently modeled using conditional adversarial networks. We study several applications of focus in computer vision such as view interpolation and depth-from-focus. We provide a tool that can interpolate different views of a scene based on focus texture segmentation and propose a novel solution for depth-from-focus using the proposed composite focus measure. In summary, this thesis consists of a comprehensive study of epsilon focus photography and its applications in the context of computer vision and computational photography.

      Year of completion:  August 2021
       Advisor : P J Narayanan

      Related Publications


        Downloads

        thesis

        Optimization for and by Machine Learning


        Pritish Mohapatra

        Abstract

        In machine learning, tasks like making predictions using a model and learning model parameters can often be formulated as optimization problems. The feasibility of using a machine learning model de- pends on the efficiency with which the corresponding optimization problems can be solved. As such, the area of machine learning throws up many challenges and interesting problems for research in the field of optimization. While in some cases, it is possible to directly apply off-the-shelf optimization methods for problems in machine learning, in many other cases, it becomes necessary to develop optimization algo- rithms that are tailor-made for specific problems. On the other hand, developing optimization algorithms for specific problem domains can itself be helped by machine learning techniques. Learning optimiza- tion algorithms from data can help relieve tedious effort required to develop optimization methods for new problem domains. The challenge here is to appropriately parameterize the space of algorithms for different optimization problems. In this context, we explore the interplay between the areas of optimiza- tion and machine learning and make contributions in specific problems of interest that lie in the overlap of these fields.

         

        Year of completion:  December 2021
         Advisor : C. V. Jawahar

        Related Publications

        • Pritish Mohapatra, C. V. Jawahar and M. Pawan Kumar -  Learning to Round for Discrete Labeling Problems, Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2018, 09 - 11 April 2018, Playa Blanca, Lanzarote.[PDF]

        • Pritish Mohapatra, Michal Rolı́nek, C. V. Jawahar, Vladimir Kolmogorov and M. Pawan Kumar -  Efficient Optimization for Rank-based Loss Functions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, 18 - 22 June 2018,Salt Lake City, Utah.[PDF]

        • Pritish Mohapatra, Puneet Kumar Dokania, C.V Jawahar and M. Pawan Kumar - Partial Linearization based Optimization for Multi-class SVM, Proceedings of European Conference on Computer Vision, (ECCV) – Amsterdam, The Netherlands, 2016. [PDF]

        • Aseem Behl, Pritish Mohapatra, C. V. Jawahar, M. Pawan Kumar - Optimizing Average Precision using Weakly Supervised Data IEEE Transations on Pattern Analysis and Machine Intelligence (TPAMI 2015). [PDF]

        • Mohak Sukhwani, Suriya Singh, Anirudh Goyal, Aseem Behl, Pritish Mohapatra, Brijendra Kumar Bharti, C.V. Jawahar - Monocular Vision based Road Marking Recognition for Driver Assistance and Safety Proceedings of the IEEE Conference on Vehicular Electronics and Safety,16-17 Dec 2014, Hyderabad, India. [PDF]

        • Pritish Mohapatra, C.V. Jawahar and M. Pawan Kumar - Efficient Optimization for Average Precision SVM Proceedings of the Neural Information Processing Systems Foundation,08-13 Dec 2014, Qubec, Canada. [PDF]


        Downloads

        thesis

         ppt

        Learning Representations for Word Images


        Praveen Krishnan

        Abstract

        Reading and writing documents is one among the primary skills with which we gather and communicate information. With the emergence of artificial intelligence (AI), researchers are in constant pursuit to build intelligent algorithms that can bring our physical and digital worlds close to each other. One such important domain is document image analysis, where we delve into the problem of understanding content from scanned document image collections. Considering “words” as the basic unit in understanding a document, in this thesis, we address the problem of finding the best possible representation for word images. Representation learning has been a key investigation for an AI problem. The primary goal of this thesis is to learn efficient representations for word images that encode its content. An ideal representation should be invariant to multiple fonts, handwritten styles and less sensitive to noise and distortions. In the past, representations have been handcrafted, specific to modalities (printed, handwritten), and sensitive to the complexities in handwriting in multi-writer scenarios. In this work, we choose the paradigm of learning from data using deep neural networks. We take our inspiration from the fact that given large amounts of annotated data, modern deep neural networks can inherently learn better representations. In this thesis, we also relax the need for large annotated datasets by heavily capitalizing on synthetically generated images. We also introduce a novel problem of learning semantic representation for word images which encodes the semantics of the word and reduces the vocabulary gap that exists between the query and the retrieved results. The first contribution of this thesis is a simple technique to generate large amounts of synthetic data, useful for pre-training deep neural networks. This led to the creation of IIIT-HWS dataset which is now widely used in the document community. The other major contributions of this thesis are: (a) the design of a deep convolutional architecture (named as HWNet) for learning an efficient holistic representation for word images, (b) a joint embedding scheme to project words and textual strings onto a common subspace, and (c) a novel form of word image representation which respects the word form along with its semantic meaning. The learned representations are evaluated under the tasks of word spotting and word recognition. We report state-of-the-art performance on popular datasets under both modern/historical and handwritten/printed document images while keeping the representation size compact in nature. Finally, in order to validate the proposed representations of this thesis, we present some interesting use cases such as (i) finding similarity between a pair of handwritten documents images, (ii) searching for keywords from online lecture videos, and (iii) building word retrieval system for Indic scripts.

         

        Year of completion:  November 2020
         Advisor : C V Jawahar

        Related Publications

        • Siddhant Bansal, Praveen Krishnan and C.V. Jawahar - Improving Word Recognition using Multiple Hypotheses and Deep Embeddings ,The 25th International Conference of Pattern Recognition  (ICPR) (ICPR 2021), Milano  [PDF]

        • Kartik Dutta, Praveen Krishnan, Minesh Mathew and C.V. Jawahar - Improving CNN-RNN Hybrid Networks for Handwriting Recognition, The 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) 2018, Niagara Falls, USA [PDF]

        • Kartik Dutta, Praveen Krishnan, Minesh Mathew and C.V. Jawahar - Towards Spotting and Recognition of Handwritten Words in Indic Scripts, The 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) 2018, Niagara Falls, USA [PDF]

        • Kartik Dutta, Praveen Krishnan, Minesh Mathew and C.V. Jawahar - Localizing and Recognizing Text in Lecture Videos, The 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) 2018, Niagara Falls, USA [PDF]

        • Vijay Rowtula, Praveen Krishnan, C.V. Jawahar - POS Tagging and Named Entity Recognition on Handwritten Documents, ICON, 2018[PDF]

        • Praveen Krishnan, Kartik Dutta and C. V. Jawahar - Word Spotting and Recognition using Deep Embedding, Proceedings of the 13th IAPR International Workshop on Document Analysis Systems, 24-27 April 2018, Vienna, Austria. [PDF]

        • Kartik Dutta,Praveen Krishnan, Minesh Mathew and C. V. Jawahar - Offline Handwriting Recognition on Devanagari using a new Benchmark Dataset, Proceedings of the 13th IAPR International Workshop on Document Analysis Systems, 24-27 April 2018, Vienna, Austria. [PDF]

        • Kartik Dutta, Praveen Krishnan, Minesh Mathew, and C. V. Jawahar -  Towards Accurate Handwritten Word Recognition for Hindi and Bangla National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2017 [PDF]

        • Praveen Krishnan and C.V Jawahar - Matching Handwritten Document Images, The 14th European Conference on Computer Vision (ECCV) – Amsterdam, The Netherlands, 2016. [PDF]

        • Praveen Krishnan,  Kartik Dutta and C.V Jawahar - Deep Feature Embedding for Accurate Recognition and Retrieval of Handwritten Text, 15th International Conference on Frontiers in Handwriting Recognition, Shenzhen, China (ICFHR), 2016. [PDF]

        • Anshuman Majumdar, Praveen Krishnan and C.V. Jawahar - Visual Aesthetic Analysis for Handwritten Document Images,15th International Conference on Frontiers in Handwriting Recognition, Shenzhen, China (ICFHR), 2016. [PDF]

        • Praveen Krishnan, Naveen Sankaran, Ajeet Kumar Singh and C. V. Jawahar - Towards a Robust OCR System for Indic Scripts Proceedings of the 11th IAPR International Workshop on Document Analysis Systems, 7-10 April 2014, Tours-Loire Valley, France. [PDF]

        • Praveen Krishnan and C V Jawahar - Bringing Semantics in Word Image Retrieval Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), 25-28 Aug. 2013, Washington DC, USA. [PDF]

        • Praveen Krishnan, Ravi Sekhar, C V Jawahar - Content Level Access to Digital Library of India Pages Proceedings of the 8th Indian Conference on Vision, Graphics and Image Processing (ICVGIP), 16-19 Dec. 2012, Bombay, India. [PDF]


        Downloads

        thesis

        More Articles …

        1. Geometry-aware methods for efficient and accurate 3D reconstruction
        2. Anatomical Structure Segmentation in Retinal Images with Some Applications in Disease Detection
        3. Recognizing People in Image and Videos
        4. Human Pose Retrieval for Image and Video collections
        • Start
        • Prev
        • 1
        • 2
        • 3
        • 4
        • 5
        • Next
        • End
        1. You are here:  
        2. Home
        3. Research
        4. MS Thesis
        5. Doctoral Dissertations
        Center for Visual Information Technology (CVIT)