Doctoral Dissertations

Interpretation and Analysis of Deep Face Representations:

Methods and Applications

Thrupthi Ann John

Abstract

The rapid growth of deep neural network models in the face domain has led to their adoption in safety-critical applications. However, a crucial limitation hindering their widespread deployment is the lack of comprehensive understanding of how these models work and the inability to explain their decisions. Explainability is essential for ensuring the correctness, reliability, and fairness of AI systems, and there is a growing recognition of its importance across AI applications. Despite the significance of explainability, most current methods are designed for general object recognition tasks and cannot be directly applied to the face domain. Faces are highly structured objects, and face tasks often involve fine-grained details, making them unique and distinct from general object recognition. This thesis aims to bridge the gap in explainability literature for the face domain by providing novel methods for interpreting and analyzing deep face representations.

In this thesis, we embark on a comprehensive journey of interpreting and analyzing deep face representations to uncover the underlying mechanisms behind DNN-based face-processing models. We first visualize face representations and introduce methods to identify functional concepts in face representations using ’cross-task aware filters’ (CRAFT). Our approach includes an efficient task-aware pruning method using CRAFTs. We also present state-of-the-art Canonical Saliency Maps (CMS) to pinpoint critical input features. We thoroughly analyze deep face representations to understand the learned features and their functional relevance in different face tasks. To further enhance our understanding of human attention in the context of driving behavior, we investigate driver gaze patterns and develop DashGaze, a large-scale naturalistic driver gaze dataset. Using this dataset, we propose an innovative calibration-free driver gaze estimation algorithm that provides valuable information for studying and predicting driver behavior.

The comprehensive overview, experimental studies, and analyses presented in this thesis contribute to the wider adoption of explainability methods in face-processing tasks, enabling safer and more trustworthy deployment of deep-face algorithms in real-world applications. By shedding light on the inner workings of these models and their biases, this work paves the way for the responsible and ethical development of AI technologies in the face domain.

Year of completion:	December 2024
Advisors :	Prof. C V Jawahar
	Prof. Vineeth N Balasubramanian

Related Publications

Downloads

Image Factorization for Inverse Rendering

Saurabh Saini

Abstract

Inverse Rendering is a core Computer Vision problem as it involves complete decomposition of an image into its constituting atomic components. These components can be stand-alone analyzed or suitably modified and recombined to solve the required image analysis task or achieve the required generative content. Rather than aiming for full decomposition, many applications only require decomposition into only a few factors which themselves are simple combinations of the underlying atomic components. This makes image factorization a critical first step in several computer vision and image processing applications. This factorization could either be optically motivated like reflectance-shading decomposition, white-balancing, illumination spectra-separation etc. or semantically motivated like style-content disentanglement, foreground-background matting etc.

In this thesis, we focus on the former and present several image factorization solutions with an aim to use it for a downstream image-based rendering application. Initially, we assume Lambertian reflection only under the classical image formation model inspired from the Retinex theory. Our first solution in this category requires multiple images of the scene as input, which we then relax for our second solution which works on the single image input. Afterwards, we propose a novel image formation model based on the specularity of the image content and provide two solutions using the low light enhancement problem as the vehicle for empirical validation. Towards the end, a novel prior induction technique is also presented based on learnable concepts and its utility is shown by improving results of pre-existing state-of-the-art image decomposition networks. We conclude with a summary, limitations, future research directions and possible additional applications. The thesis is organized into four units respectively discussing the problem definition and significance; Lambertian reflection based Intrinsic Image Decomposition problem, specularity respecting novel illumination factorization methods and finally concept based model analysis and conclusion. We hope that with the problems and solutions discussed in this thesis we will be able to define and highlight the importance of image factorization step in multiple vision tasks and pique reader’s interest in this research problem for image generation and beyond.

Year of completion:	August 2024
Advisor :	Jayanthi Sivaswamy

Related Publications

Downloads

Modelling Structural Variations in Brain Aging

Alphin J Thottupattu

Abstract

The aging of the brain is a complex process shaped by a combination of genetic factors and environmental influences, exhibiting variations from one population to another. This thesis investigates normative population-specific structural changes in the brain and explores variations in aging-related changes across different populations. The study gathers data from diverse groups, constructs individual models, and compares them through a thoughtfully designed framework. This thesis proposes as a comprehensive pipeline covering data collection, modeling, and the creation of an analysis framework. Finally, it offers an illustrative cross-population analysis, shedding light on the comparative aspects of brain aging. In our study, the Indian population is considered as the reference, and an effort is made to ad- dress gaps within this population through the creation of a population-specific database, an atlas, and an aging model to facilitate the study. Due to the challenges in data collection, we adopted a cross-sectional approach. A cross-sectional brain image database is meticulously curated for In- dian population. A sub-cortical structural atlas is created for the young population, enabling us to establish reference structural segmentation map for the Indian population. Age-specific, gender balanced, and high-resolution scans collected to create the first Indian brain aging model. Choosing cross-sectional data collection made sense because data from other populations were also mostly collected in a cross-sectional manner. Using the in-house database for Indian population and pub- licly available datasets for other populations, our inter-population analysis compares aging trends across Indian, Caucasian, Chinese, and Japanese populations. Developing an aging model from cross-sectional data presents challenges in distinguishing between cross-sectional variations and normative trends. In response, we proposed a method specifically tailored for cross-sectional data. We present a unique metric within our comprehensive aging comparison framework to differentiate between temporal and global anatomical variations across populations. This thesis has detailed a comprehensive process to compare the aspects of healthy aging across these diverse groups, ultimately concluding with a pilot study across four different populations. This framework can be readily adapted to study various research problems, exploring changes associated with different populations while considering factors beyond ethnicity, such as lifestyle, education, socio-economic factors, etc. Similar analysis frameworks and studies with multiple modalities and larger sample sizes will contribute to deriving more conclusive results.

Year of completion:	May 2024
Advisor :	Jayanthi Sivaswamy

Related Publications

Downloads

Towards Machine-understanding of Document Images

Minesh Mathew

Abstract

Imparting machines the capability to understand documents like humans do is an AI-complete problem since it involves multiple sub-tasks such as reading unstructured and structured text, understanding graphics and natural images, interpreting visual elements such as tables and plots, and parsing the layout and logical structure of the whole document. Except for a small percentage of documents in structured electronic formats, a majority of the documents used today, such as documents in physical mediums, born-digital documents in image formats, and electronic documents like PDFs, are not readily machine readable. A paper-based document can easily be converted into a bitmap image using a flatbed scanner or a digital camera. Consequently, machine understanding of documents in practice requires algorithms and systems that can process document images—a digital image of a document. Successful application of deep learning-based methods and use of large-scale datasets significantly improved the performance of various sub-tasks that constitute the larger problem of machine understanding of document images. Deep-learning based techniques have successfully been applied to the detection, and recognition of text and detection and recognition of various document sub-structures such as forms and tables. However, owing to the diversity of documents in terms of language, modality of text present (typewritten, printed, handwritten or born-digital), images and graphics (photographs, computer graphics, tables, visualizations, and pictograms), layout and other visual cues, building generic-solutions to the problem of machine understanding of document images is a challenging task. In this thesis, we address some of the challenges in this space, such as text recognition in low-resource languages, information extraction from historic/handwritten collections and multimodal modeling of complex document images. Additionally, we introduce new tasks that call for a top-down perspective— to understand a document image as a whole, not in parts—of document image understanding, different from the mainstream trend where the focus has been on solving various bottom-up tasks. Most of the existing tasks in Document Image Analysis (DIA) deal with independent bottom-up tasks that aim to get a machine-readable description of certain, pre-defined document elements at various abstractions such as text tokens or tables. This thesis motivates a purpose-driven DIA wherein a document image is analyzed dynamically, subject to a specific requirement set by a human user or an intelligent agent. We first consider the problem of making document images printed in low-resource languages machine-readable using an OCR and thereby making these documents AI-ready. To this end, we propose to use an end-to-end neural network model that can directly transcribe a word or line image from a document to corresponding Unicode transcription. We analyze how the proposed setup overcomes many challenges to text recognition of Indic languages. Results of our synthetic to real transfer learning experiments for text recognition demonstrate that models pre-trained on synthetic data and further fine-tuned on a portion of the real data perform as well as models trained purely on real data. For 10+ languages for which there have not been public datasets for printed text recognition, we introduce a new dataset that has more than one million word images in total. We further conduct an empirical study to compare different end-to-end neural network architectures for word and line recognition of printed text. Another significant contribution of this thesis is the introduction of new tasks that require a holistic understanding of document images. Different from existing tasks in Document Image Analysis (DIA) that attempt to solve independent bottom-up tasks, we motivate a top-down perspective of DIA that requires a holistic understanding of the image and purpose-driven information extraction. To this end, we propose two tasks—DocVQA and InfographicVQA— fashioned along Visual Question Answering (VQA) in computer vision. For DocVQA, we show results using multiple strong baselines that are adapted from existing models for existing VQA and QA problems. For InfographicVQA, we propose a transformer-based, BERT-like model that jointly models multimodal—vision, language, and layout—input. We conduct open challenges for both tasks, attracting hundreds of submissions so far. Next, we work on the problem of information extraction from a document image collection. Recognizing text from historical and/or handwritten manuscripts is a major challenge to information extraction from such collections. Similar to open-domain QA in NLP, we propose a new task in the context of document images that seek to get answers for natural language questions asked on collections of manuscripts. We propose a two-stage retrieval-based approach for the problem that uses deep features of word images and textual words. Our approach is recognition-free and returns image snippets as answers to the questions. Although our approach is recognition-free and consequently oblivious to the semantics of the text in the documents, it can look for documents or document snippets that are lexically similar to the question. We show that our approach is a reasonable alternative when using text-based QA models is infeasible due to the difficulty in recognizing text in the document images.

Year of completion:	January 2024
Advisor :	C V Jawahar

Related Publications

Downloads

Surrogate Approximations for Similarity Measures

Nagender G

Abstract

This thesis targets the problem of surrogate approximations for similarity measures to improve their performance in various applications. We have presented surrogate approximations for popular dynamic time warping (DTW) distance, canonical correlation analysis (CCA), Intersection-over-Union (IoU), PCP, and PCKh measures. For DTW and CCA, our surrogate approximations are based on their corresponding definitions. We presented a surrogate approximation using neural networks for IoU, PCP, and PCKh measures.

First, we propose a linear approximation for the naïve DTW distance. We try to speed up the DTW distance computation by learning the optimal alignment from the training data. We propose a surrogate kernel approximation over CCA in our next contribution. It enables us to use CCA in the kernel framework, further improving its performance. In our final contribution, we propose a surrogate approximation technique using neural networks to learn a surrogate loss function over IoU, PCP, and PCKh measures. For IoU loss, we validated our method over semantic segmentation models. For PCP, and PCKh loss, we validated over human pose estimation models.

Year of completion:	March 2023
Advisor :	C V Jawahar

Related Publications

Downloads

-->

Interpretation and Analysis of Deep Face Representations:

Methods and Applications

Thrupthi Ann John

Abstract

Related Publications

Downloads

Image Factorization for Inverse Rendering

Saurabh Saini

Abstract

Related Publications

Downloads

Modelling Structural Variations in Brain Aging

Alphin J Thottupattu

Abstract

Related Publications

Downloads

Towards Machine-understanding of Document Images

Minesh Mathew

Abstract

Related Publications

Downloads

Surrogate Approximations for Similarity Measures

Nagender G

Abstract

Related Publications

Downloads

More Articles …