CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Summer School 2026
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Bridging Perception and Reasoning in Table Understanding: The Path from Recognition to Trustworthy and Explainable AI


Sachin Raja

Abstract

This doctoral research presents a comprehensive and multi-faceted investigation into automated table understanding, arguing that robust and reliable solutions demand an approach that evolves from foundational structural parsing to address the pressing real-world requirements of data privacy and auditable reasoning. Tables are information-rich, structured objects that serve as a cornerstone for conveying complex data, yet their automated parsing is a formidable, long-standing challenge in document intelligence. The core of this challenge lies in Table Structure Recognition (TSR), the process of transforming a table image into a structured, machine-readable format. The difficulty is rooted in the immense visual diversity of tables, with complexities such as spanning cells, multi-line text, and the absence of ruling lines often causing traditional and early deep learning methods to fail. This body of work charts a clear research trajectory that begins with the development of a novel framework for TSR, TabStruct-Net, and progressively refines this methodology to handle increasing visual complexity. The research then pivots to address critical non-functional requirements, pioneering TabGuard, a novel framework for privacy-preserving TSR, and finally extends its scope from structure to trusted reasoning by introducing EviFiVQA, a benchmark for financial Visual Question Answering (VQA) that establishes evidence localization as a core tenet of auditable AI. This journey from pixels to privacy and proof marks a significant contribution to the field.

The core methodology advanced throughout this research is anchored in a powerful two-step paradigm that mirrors human cognitive processes: a top-down decomposition followed by a bottom-up reconstruction. In the top-down phase, the table image is decomposed into its fundamental constituent parts—the individual table cells—through an object detection model. In the bottom-up phase, the global table structure is reconstructed by learning the spatial and logical associations between the detected cells. A cornerstone of this research is the novel insight that TSR performance can be dramatically improved by encoding human intuition about table structure directly into the learning objective. This was achieved through a series of innovative, cognitive-inspired loss functions that act as structural regularizers, in- cluding an Alignment Loss to enforce a grid-like structure, a Continuity Loss to ensure adjacent cell boundaries are contiguous, and an Overlapping Loss to penalize spatial conflicts. This approach is marked by a clear architectural evolution, beginning with TabStruct-Net, which combined a modified Mask R-CNN with a Dynamic Graph Convolutional Neural Network, and culminating in TabStruct-Net V2, which introduced a Hierarchical Local-Attention Vision Transformer (HLVIT) backbone and a highly efficient self-attention layer to achieve state-of-the-art performance and scalability.

 

Year of completion:  April 2026
 Advisor :

Prof. C.V. Jawahar


Related Publications


    Downloads

    thesis

     

    Lip-to-Speech Synthesis


    Rudrabha Mukhopadhyay

    Abstract

    This thesis explores the development and advancement of lip-to-speech synthesis techniques, addressing the challenge of generating speech directly from visual lip movements. Unlike text-to-speech systems that rely on explicit linguistic information in the form of text tokens, lip-to-speech synthesis, aims to interpret ambiguous visual cues, presenting unique challenges in mapping similar lip shapes that can produce different sounds. Inspired by the chronological advancements in text-to-speech synthesis the research goals are broken into single-speaker lip-to-speech where a specific model is trained for each speaker with a large amount of speaker-specific data followed by multi-speaker approaches which aims to train a single model which can work for any speaker in-the-wild.

    The first work presented in this thesis deals with lip-to-speech generation problem in large vocabulary in unconstrained settings albeit with a model trained for a particular speaker. In this work, a novel sequence-to-sequence model was introduced that leveraged spatio-temporal convolutional architectures to effectively capture the fine-grained temporal dynamics of lip movements and implemented a monotonic attention mechanism that more accurately aligned the visual features with corresponding speech parameters. Testing on the LRS2 dataset showed a 24% improvement in intelligibility metrics over baseline methods. In this work, a new dataset was released providing sufficient speaker-specific data with a diverse vocabulary of around 5, 000 words to support the development of accurate, speaker-specific models. While this approach showed promise, it was obviously limited to single-speaker scenarios and failed to scale effectively to sentence-level multi-speaker tasks, necessitating further research.

     

    Year of completion:  June 2025
     Advisor :

    Dr. C.V. Jawahar


    Related Publications


      Downloads

      thesis

       

      Efficient Physically Based Rendering with Analytic and Neural Approximations


      Aakash KT

      Abstract

      Path tracing is ubiquitous for photorealistic rendering of various real-world appearances. It follows the principles of light transport adapted from physics, which describe light propagation as a set of integral equations. These equations are stochastically evaluated by tracing light rays in virtual scenes. Such stochastic evaluations with ray-tracing form the bulk of the path tracing algorithm, which is widely used in the industry.

      Stochastic evaluations in path tracing converge to the correct answer in time that is inversely proportional to the square root of the number of iterations. This coupled with the fact that the underlying integrals are often complex and high dimensional results in large compute complexity. Research efforts have thus largely focused on accelerating path tracing by improving the stochastic sampling processes. However, it is interesting to look at efficient analytic approximations by making reasonable assumptions on the nature of these light transport integrals. Such analytic methods have the potential to achieve zero variance at the outset. Practically, they are often used in conjunction with stochastic methods thereby achieving lower variance than the fully stochastic counterparts.

      The primary focus of this thesis is to develop new (semi-)analytic methods and improve existing ones to accelerate direct lighting computations in path tracing. We base our research on the theory of Linearly Transformed Cosines (LTC) applied for direct lighting from area lights. The LTC method produces plausible renderings by building on the principles of light transport from the ground up and has proved useful for tasks other than real-time rendering We make the following three contributions that either build on LTCs or improve it.

      We first explore fully-analytic direct lighting for arbitrarily shaped area lights, built on LTCs at the core. Due to assumptions of the LTC method, it can only handle polygonal area lights. Furthermore, rendering shadows with LTCs require stochastic evaluations - our contribution here relaxes these assumptions, enabling fully-analytic direct lighting with shadows from an arbitrary shaped area light. We show that our method achieves plausible and noise-free renderings compared to semi-analytic LTCs and ground truth ray-tracing, given equal compute budget.

       

      Year of completion:  June 2025
       Advisor : Dr. P. J. Narayanan

      Related Publications

      Downloads

      thesis

      3D Shape Analysis: Reconstruction and Classification


      Jinka Sai Sagar

      Abstract

      The reconstruction and analysis of 3D objects by computational systems has been an intensive and long-lasting research problem in the graphics and computer vision scientific communities. Traditional acquisition systems are largely restricted to studio environment setup which requires multiple synchronized and calibrated cameras. With the advent of active depth sensors like time-of-flight sensors, structured lighting sensors made 3D acquisition feasible. This advancement of technology has paved way to many research problems like 3D object localization, recognition, classification, reconstruction which demand innovating sophisticated/elegant solutions to match their ever growing applications. 3D human body reconstruction, in particular, has wider applications like virtual mirror, gait analysis, etc. Lately, with the advent of deep learning, 3D reconstruction from monocular images garnered significant interest among the research community as it can be applied to in-the-wild settings.

      The reconstruction and analysis of 3D objects by computational systems has been an intensive and long-lasting research problem in the graphics and computer vision scientific communities. Traditional acquisition systems are largely restricted to studio environment setup which requires multiple synchronized and calibrated cameras. With the advent of active depth sensors like time-of-flight sensors, structured lighting sensors made 3D acquisition feasible. This advancement of technology has paved way to many research problems like 3D object localization, recognition, classification, reconstruction which demand innovating sophisticated/elegant solutions to match their ever growing applications. 3D human body reconstruction, in particular, has wider applications like virtual mirror, gait analysis, etc. Lately, with the advent of deep learning, 3D reconstruction from monocular images garnered significant interest among the research community as it can be applied to in-the-wild settings.

      Secondly, we propose PeeledHuman - a novel shape representation of the human body that is robust to self-occlusions. PeeledHuman encodes the human body as a set of Peeled Depth and RGB maps in 2D, obtained by performing ray-tracing on the 3D body model and extending each ray beyond its first intersection. We learn these Peeled maps in an end-to-end generative adversarial fashion using our novel framework - PeelGAN. The PeelGAN enables us to predict shape and color of the 3D human in an end-to-end fashion at significantly low inference rates.

       

      Year of completion:  May 2023
       Advisor : Dr. Avinash Sharma

      Related Publications


        Downloads

        thesis

        Surrogate Approximations for Similarity Measures


        Nagendar G

        Abstract

        This thesis targets the problem of surrogate approximations for similarity measures to improve their performance in various applications. We have presented surrogate approximations for popular dynamic time warping (DTW) distance, canonical correlation analysis (CCA), Intersection-over-Union (IoU), PCP, and PCKh measures. For DTW and CCA, our surrogate approximations are based on their corresponding definitions. We presented a surrogate approximation using neural networks for IoU, PCP, and PCKh measures.

        First, we propose a linear approximation for the naïve DTW distance. We try to speed up the DTW distance computation by learning the optimal alignment from the training data. We propose a surrogate kernel approximation over CCA in our next contribution. It enables us to use CCA in the kernel framework, further improving its performance. In our final contribution, we propose a surrogate approximation technique using neural networks to learn a surrogate loss function over IoU, PCP, and PCKh measures. For IoU loss, we validated our method over semantic segmentation models. For PCP, and PCKh loss, we validated over human pose estimation models.

         

        Year of completion:  Novenber 2023
         Advisor : Dr. Avinash Sharma

        Related Publications


          Downloads

          thesis

          More Articles …

          1. Learning with Weak Supervision for Visual Scene Understanding
          2. Document Image Layout Segmentation and Applications
          3. Interpretation and Analysis of Deep Face Representations: Methods and Applications
          4. Image Factorization for Inverse Rendering
          • Start
          • Prev
          • 1
          • 2
          • 3
          • 4
          • 5
          • Next
          • End
          1. You are here:  
          2. Home
          3. Research
          4. MS Thesis
          5. Doctoral Dissertations
          Center for Visual Information Technology (CVIT)