CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Face Sketch Generation and Recognition


Kushal Kumar Jain

Abstract

The field of sketch generation and recognition has seen significant advancements through the innovative application of generative models. This thesis presents a comprehensive exploration of face stylization , artistic portrait generation and forensic sketch synthesis, leveraging the power of stateof-the-art generative models like StyleGAN and StableDiffusion. Our work addresses key challenges in preserving identity, accommodating various poses, and bridging the modality gap between sketches and photographs. Through three interconnected studies, we demonstrate significant advancements in generating high-quality sketches and improving forensic applications.

We begin by introducing a novel approach to face cartoonization that preserves identity and accommodates various poses. Unlike conditional-GAN methods, our technique utilizes an encoder to capture pose and identity information, generating embeddings within StyleGAN’s latent space. This approach uniquely adapts a pre-trained StyleGAN model, originally designed for realistic facial images, to produce cartoonized outputs without requiring a dedicated fine-tuned model.

Building upon this foundation, we present Portrait Sketching StyleGAN (PS-StyleGAN), a style transfer approach tailored for portrait sketch synthesis. PS-StyleGAN leverages StyleGAN’s semantic W+ latent space to generate portrait sketches while allowing meaningful edits such as pose and expression alterations. By introducing Attentive Affine transform blocks and a specialized training strategy, we achieve high-quality sketch generation without fine-tuning StyleGAN itself. This method demonstrates superior performance over current state-of-the-art techniques, requiring only a small number of paired examples and minimal training time.

Finally, we address the challenging task of forensic sketch-to-mugshot matching with CLIP4Sketch, a novel approach that uses diffusion models to generate diverse sketch images. By combining CLIP and Adaface embeddings of reference mugshots with textual-style descriptions, we create a comprehensive dataset of sketches corresponding to mugshots. This synthetic data significantly improves the accuracy of sketch-to-mugshot matching in face recognition systems, outperforming training on limited real face sketch data and datasets made by GAN-based methods.

Collectively, these contributions push the boundaries of sketch generation and recognition, offering promising applications in both the artistic and forensic domains.

 

Year of completion:  May 2025
 Advisor : Anoop Namboodiri

Related Publications


    Downloads

    thesis

     

    Coreference Without Bells and Whistles


    S Kawshik Manikantan

    Abstract

    Coreference resolution (CR) is the task of identifying text spans that refer to the same entity. It is a fundamental component of natural language understanding with applications in various downstream NLP tasks, such as question answering, knowledge graph construction, and summarization. Despite its significance and the advancements made by neural coreference models, CR models face a major bottleneck: their limited generalization capability.

    Prior work attributes this generalization gap to differences in annotations, such as what constitutes a mention (or entity) and varying preferences to span boundaries. For a model to have strong referential capabilities, it must adapt to these annotation-specific nuances. However, achieving this level of adaptability remains a significant challenge, even for state-of-the-art (SOTA) models. This challenge is further amplified when evaluating the referential capabilities of large language models (LLMs) in a few-shot setting, where replicating nuanced annotations with just a few examples is highly unrealistic. We observe that these annotation-specific nuances, can be beneficial but are not essential for downstream tasks or for evaluating the core referential capabilities of an LLM. We describe these nuances as bells and whistles.

    In this work, we redefine the traditional formulation of coreference resolution by shifting focus away from its bells and whistles. Instead, we propose task formulations more aligned with practical applications and demonstrate improved generalizability across domains.

    Our first contribution introduces an alternative referential task, Major Entity Identification (MEI). MEI simplifies referential tasks by:(a) assuming that target entities are explicitly provided in the input, and (b) focusing exclusively on frequent entities. Assuming entities to be part of the input shifts the responsibility for domain-specific annotation adaptation—determining which entities are annotated—from the training phase to inference. Through extensive experiments, we show that MEI models generalize effectively across domains using both supervised approaches and LLM-based few-shot prompting across multiple datasets. Importantly, MEI aligns with the classification framework, enabling the use of robust, intuitive, and well-understood classification-based evaluation metrics. Beyond its theoretical appeal, MEI also has practical utility as it allows users to efficiently search for all mentions of a specific entity or a group of entities of interest.

    Our second major contribution addresses critical shortcomings identified in recent evaluations of large language models (LLMs) on coreference resolution. These studies revealed that traditional output formats and evaluation metrics fail to capture models’ referential understanding fully. Traditional evaluation methods require reproducing the entire document along with annotated cluster information or precisely replicating the antecedent span. This introduces additional bells and whistles, such as ensuring the accurate reproduction of spans and documents. To tackle this issue, we introduce IdentifyMe, a new benchmark for mention resolution that adopts a multiple-choice question (MCQ) format—a widely used evaluation approach for LLMs. With this simplified task design, any failure can now, be attributed exclusively to issues with mention resolution. IdentifyMe presents long narratives and applies heuristics to eliminate easily identifiable mentions, resulting in a more challenging and rigorous task. The benchmark incorporates a curated mix of various mention types and their corresponding entities, enabling fine-grained analysis of model performance. Notably, LLM performance remains substantially below human-level performance on IdentifyMe, highlighting considerable room for improvement even for advanced models like GPT-4. The evaluation also reveals key weaknesses in current LLMs, particularly with pronominal mentions, nested mentions, and other nuanced cases.

    Overall, this work moves beyond traditional coreference resolution formulations, focusing on tasks with practical applicability and providing fresh insights into the referential strengths and weaknesses of current models. We term this approach Coreference Without Bells and Whistles — a streamlined perspective that prioritizes utility and understanding of model capabilities over tailored annotation adaptation.

     

    Year of completion:  May 2025
     Advisor : Vineet Gandhi

    Related Publications


      Downloads

      thesis

       

      Document Image Layout Segmentation and Applications


      Jobin K.V.

      Abstract

      A document has information emerging out of the existence of various physical entities or regions such as headings, paragraphs, figures, captions, tables, and backgrounds along with the textual content. To decipher the information in a document, a human reader uses a variety of additional cues, such as context, conventions, and information about language, script, location, and a complex reasoning process. Millions of documents are created and distributed daily over the Internet and printed media. Understand- ing, analyzing, sorting, and comparing a massive collection of documents in a limited time is a hectic job for humans. Here, the automatic document image understanding systems (DIUS) help humans do this tedious task within a limited time. The DIUS have typically a document image layout segmentation module and information extraction modules. This thesis focuses on the challenging problems related to document image layout segmentation in various types of documents and their applications.

      In this thesis, first, we analyse the various document images using deep features. The deep features are the features extracted using a pretrained deep neural network. To study deep texture features, we propose a deep network architecture that independently learns texture patterns, discriminative patches, and shapes to solve various document image analysis tasks. The considered tasks are document image classification, genre identification from book covers, scientific document figure classification, and script identification. The presented network learns global, texture, and discriminative features and combines them judicially based on the nature of the problems. We compare the performance of the proposed approach with state-of-the-art techniques on multiple publicly available datasets such as Book covers, RVL-CDIP, CVSI and DocFigure. Experiments show that our approach obtains significantly better performance over state-of-the-art for all tasks.

      Next, we focus on the problem of document image layout segmentation and propose a solution for a class of document images, including historical, scientific, and classroom slide document images. The historical document image segmentation problem is modeled as a pixel labeling task where each pixel in the document image is classified into one of the predefined labels, such as text, comment, decoration, and background. The method first extracts deep features from the superpixels of the document image. Then, we learn SVM classifier using these features and segment the document image. The pyramid pooling module is used to extract the logical regions of scientific document images. In the classroom slide images, the logical regions are distributed based on the location of the image. To utilize the location of the logical regions for slide image segmentation, we propose the architecture, Classroom Slide Segmentation Network (CSSN). The unique attributes of this architecture differ from most other semantic segmentation networks. We validate the performance of our segmentation architecture using publicly available benchmark datasets

      Next, we analyze the output regions of document layout segmentation. Figures used in the documents are the complex regions to decipher the information by a DIUS. Hence, document figure classification (DFC) is an important stage of the DIUS. The design of a DFC system required well-defined figure cate- gories and datasets. Existing datasets related to the classification of figures in the document images are limited in size and category. We introduce a scientific figure classification dataset named DocFigure. The dataset consists of 33K annotated figures of 28 different categories present in the document images, which correspond to scientific articles published in last several years. Manual annotation of such a large number (33K) of figures is time-consuming and cost-ineffective. We design a web-based annotation tool that can efficiently assign category labels to many figures with the minimum effort of human annotators. To benchmark our generated dataset on the classification task, we propose three baseline classification techniques using the deep feature, deep texture feature, and both. Our analysis found that the combi- nation of both deep and texture features is more effective for document figure classification tasks than individual features

      Finally, we propose the application backed by the research of this thesis. Slide presentations are an effective and efficient tool for classroom communication used by the teaching community. However, this teaching model can be challenging for blind and visually impaired (VI) students as such students require personal human assistance to understand the presented slide. This shortcoming motivates us to design a Classroom Slide Narration System (CSNS) that generates audio descriptions corresponding to the slide content. This problem poses an image-to-markup language generation task. Extract logical regions such as title, text, equation, figure, and table from the slide image using CSSN. We extract the content (information) from the slide using four well-established modules: optical character recognition (OCR), figure classification, equation description, and table structure recognizer. With this information, we build a Classroom Slide Narration System (CSNS) to help VI students understand the slide content. The users have given better feedback on the quality output of the proposed CSNS in comparison to existing systems like Facebook’s Automatic Alt-Text (AAT) and Tesseract.

       

      Year of completion:  April 2025
       Advisors : Prof. C V Jawahar

      Related Publications


        Downloads

        thesis

         

        Interpretation and Analysis of Deep Face Representations:

        Methods and Applications


        Thrupthi Ann John

        Abstract

        The rapid growth of deep neural network models in the face domain has led to their adoption in safety-critical applications. However, a crucial limitation hindering their widespread deployment is the lack of comprehensive understanding of how these models work and the inability to explain their decisions. Explainability is essential for ensuring the correctness, reliability, and fairness of AI systems, and there is a growing recognition of its importance across AI applications. Despite the significance of explainability, most current methods are designed for general object recognition tasks and cannot be directly applied to the face domain. Faces are highly structured objects, and face tasks often involve fine-grained details, making them unique and distinct from general object recognition. This thesis aims to bridge the gap in explainability literature for the face domain by providing novel methods for interpreting and analyzing deep face representations.

              In this thesis, we embark on a comprehensive journey of interpreting and analyzing deep face representations to uncover the underlying mechanisms behind DNN-based face-processing models. We first visualize face representations and introduce methods to identify functional concepts in face representations using ’cross-task aware filters’ (CRAFT). Our approach includes an efficient task-aware pruning method using CRAFTs. We also present state-of-the-art Canonical Saliency Maps (CMS) to pinpoint critical input features. We thoroughly analyze deep face representations to understand the learned features and their functional relevance in different face tasks. To further enhance our understanding of human attention in the context of driving behavior, we investigate driver gaze patterns and develop DashGaze, a large-scale naturalistic driver gaze dataset. Using this dataset, we propose an innovative calibration-free driver gaze estimation algorithm that provides valuable information for studying and predicting driver behavior.

              The comprehensive overview, experimental studies, and analyses presented in this thesis contribute to the wider adoption of explainability methods in face-processing tasks, enabling safer and more trustworthy deployment of deep-face algorithms in real-world applications. By shedding light on the inner workings of these models and their biases, this work paves the way for the responsible and ethical development of AI technologies in the face domain.

         

        Year of completion:  December 2024
         Advisors : Prof. C V Jawahar
          Prof. Vineeth N Balasubramanian

        Related Publications


          Downloads

          thesis

           

          Image Factorization for Inverse Rendering


          Saurabh Saini

          Abstract

          Inverse Rendering is a core Computer Vision problem as it involves complete decomposition of an image into its constituting atomic components. These components can be stand-alone analyzed or suitably modified and recombined to solve the required image analysis task or achieve the required generative content. Rather than aiming for full decomposition, many applications only require decomposition into only a few factors which themselves are simple combinations of the underlying atomic components. This makes image factorization a critical first step in several computer vision and image processing applications. This factorization could either be optically motivated like reflectance-shading decomposition, white-balancing, illumination spectra-separation etc. or semantically motivated like style-content disentanglement, foreground-background matting etc.

                 In this thesis, we focus on the former and present several image factorization solutions with an aim to use it for a downstream image-based rendering application. Initially, we assume Lambertian reflection only under the classical image formation model inspired from the Retinex theory. Our first solution in this category requires multiple images of the scene as input, which we then relax for our second solution which works on the single image input. Afterwards, we propose a novel image formation model based on the specularity of the image content and provide two solutions using the low light enhancement problem as the vehicle for empirical validation. Towards the end, a novel prior induction technique is also presented based on learnable concepts and its utility is shown by improving results of pre-existing state-of-the-art image decomposition networks. We conclude with a summary, limitations, future research directions and possible additional applications. The thesis is organized into four units respectively discussing the problem definition and significance; Lambertian reflection based Intrinsic Image Decomposition problem, specularity respecting novel illumination factorization methods and finally concept based model analysis and conclusion. We hope that with the problems and solutions discussed in this thesis we will be able to define and highlight the importance of image factorization step in multiple vision tasks and pique reader’s interest in this research problem for image generation and beyond.

          Year of completion:  August 2024
           Advisor : Jayanthi Sivaswamy

          Related Publications


            Downloads

            thesis

             

            More Articles …

            1. Modelling Structural Variations in Brain Aging
            2. Towards Machine-understanding of Document Images
            3. 3D Shape Analysis: Reconstruction and Classification
            4. Surrogate Approximations for Similarity Measures
            • Start
            • Prev
            • 1
            • 2
            • 3
            • 4
            • 5
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Doctoral Dissertations
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.