Bridging Perception and Reasoning in Table Understanding: The Path from Recognition to Trustworthy and Explainable AI

Sachin Raja

Abstract

This doctoral research presents a comprehensive and multi-faceted investigation into automated table understanding, arguing that robust and reliable solutions demand an approach that evolves from foundational structural parsing to address the pressing real-world requirements of data privacy and auditable reasoning. Tables are information-rich, structured objects that serve as a cornerstone for conveying complex data, yet their automated parsing is a formidable, long-standing challenge in document intelligence. The core of this challenge lies in Table Structure Recognition (TSR), the process of transforming a table image into a structured, machine-readable format. The difficulty is rooted in the immense visual diversity of tables, with complexities such as spanning cells, multi-line text, and the absence of ruling lines often causing traditional and early deep learning methods to fail. This body of work charts a clear research trajectory that begins with the development of a novel framework for TSR, TabStruct-Net, and progressively refines this methodology to handle increasing visual complexity. The research then pivots to address critical non-functional requirements, pioneering TabGuard, a novel framework for privacy-preserving TSR, and finally extends its scope from structure to trusted reasoning by introducing EviFiVQA, a benchmark for financial Visual Question Answering (VQA) that establishes evidence localization as a core tenet of auditable AI. This journey from pixels to privacy and proof marks a significant contribution to the field.

The core methodology advanced throughout this research is anchored in a powerful two-step paradigm that mirrors human cognitive processes: a top-down decomposition followed by a bottom-up reconstruction. In the top-down phase, the table image is decomposed into its fundamental constituent parts—the individual table cells—through an object detection model. In the bottom-up phase, the global table structure is reconstructed by learning the spatial and logical associations between the detected cells. A cornerstone of this research is the novel insight that TSR performance can be dramatically improved by encoding human intuition about table structure directly into the learning objective. This was achieved through a series of innovative, cognitive-inspired loss functions that act as structural regularizers, in- cluding an Alignment Loss to enforce a grid-like structure, a Continuity Loss to ensure adjacent cell boundaries are contiguous, and an Overlapping Loss to penalize spatial conflicts. This approach is marked by a clear architectural evolution, beginning with TabStruct-Net, which combined a modified Mask R-CNN with a Dynamic Graph Convolutional Neural Network, and culminating in TabStruct-Net V2, which introduced a Hierarchical Local-Attention Vision Transformer (HLVIT) backbone and a highly efficient self-attention layer to achieve state-of-the-art performance and scalability.

Year of completion:	April 2026
Advisor :	Prof. C.V. Jawahar