CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

On the Democratization of Realistic 3D Head Avatar Generation and Reconstruction


Pranav Manu

Abstract

The need for photorealistic head avatars has risen in the past decades, owing to the rising interest in the AR/VR media formats. An accurate representation of the head will be required in the near future, which is essential to facilitate communication between users, essentially enabling telepresence. The need for an improved in-person form of remote communication was made more clear during the recent COVID-19 pandemic and the ensuing lockdown, where millions of people had to stay away from their families and workplace for an extensive period of time. Besides, realistic facial avatars have proved immensely helpful in the movie and gaming industry, where they have often been used to either modify the actors’ appearance itself, or to drive an entirely virtual but realistically looking digital character, depending on the demands of the narrative.

Capturing and reconstructing a realistic-looking head-avatar is not trivial, and requires an expensive setup of multiple synced cameras and lights, and a mathematical understanding of how light interacts with the skin, hair, cornea, etc. The capture of each subject is laborious and time-consuming. The creation of digital faces that are indistinguishable from real ones is a formidable challenge due to the ”uncanny valley” phenomenon, where even minor deviations from realistic appearance can render a digital face unsettling to human observers. However, to achieve the applications of realistic head avatars in telepresence and AR/VR, the capture and thus creation of realistic digital replicas must be made accessible. Therefore, a need has arisen to search for methods that can reconstruct and create digital replicas that are photorealistic but also cheap. Our thesis aims to tackle this problem statement in two ways, one from the perspective of digital replica generation and the other from the perspective of creating a digital replica through reconstruction.

Our initial approach to make the creation of digital replicas efficient is a textured head generation method conditioned on a descriptive text. We aim to create a method that can generate a realistic-looking head avatar from a text description in an efficient manner, without requiring the manual intervention of artists or the use of highly specialised software like Blender or Maya. Therefore, it can generate textured head assets within seconds. However, the texture-based synthesis approach suffered from reduced realism because of the effects of baked-in lighting. Therefore, an approach is required that could construct a head avatar along with accurate material properties, such that it can be placed in any environment.

 

Year of completion:  June 2025
 Advisor 1 : Dr. Avinash Sharma
 Advisor 2 : Prof. PJ Narayanan

Related Publications


    Downloads

    Seeing, Describing and Remembering: A Study on Audio Descriptions and Video Memorability


    Eshika Khandelwal

    Abstract

    The human brain undergoes continuous structural changes throughout the lifespan, driven by a complex interplay of aging processes, environmental influences, and disease-related mechanisms. Patterns of structural change—particularly atrophy associated with tissue loss and shrinkage—emerge gradually over time and are observable using medical imaging techniques. While these changes are shaped by common biological mechanisms, they are also highly individualized, influenced by factors such as lifestyle, and neurological conditions like Alzheimer’s Disease (AD), Parkinson’s disease, tumors, and stroke. Understanding the progression of these changes—both at the individual level and across populations—is critical for advancing our knowledge of healthy aging and the dynamics of neurodegenerative disease.

    To study how brain structure evolves over time, researchers rely on longitudinal neuroimaging: repeated imaging of the same individuals at multiple timepoints. Unlike cross-sectional imaging, which captures a single snapshot per subject, longitudinal scans provide a temporal sequence that enables direct observation of anatomical trajectories. These sequences allow for the measurement of rates of change, identification of early biomarkers, and modeling of disease progression in a subject-specific manner.

    However, acquiring complete longitudinal datasets in practice remains challenging. Subject dropout, missed clinical visits, and protocol variability often result in missing scans, interrupting the temporal continuity required for accurate modeling. These gaps limit the effectiveness of methods that rely on temporally complete inputs and can bias downstream analyses. Imputing the missing scan to complete the subject’s imaging timeline is therefore a critical step toward enabling robust longitudinal modeling and improving our understanding of neurodegenerative processes.

     

    Year of completion:  June 2025
     Advisor : Makarand Tapaswi

    Related Publications


      Downloads

      The Anatomy of Synthesis: Simulating Changes in the Human Brain over Time through Diffeomorphic Deformations


      Anirudh Kaushik

      Abstract

      The human brain undergoes continuous structural changes throughout the lifespan, driven by a complex interplay of aging processes, environmental influences, and disease-related mechanisms. Patterns of structural change—particularly atrophy associated with tissue loss and shrinkage—emerge gradually over time and are observable using medical imaging techniques. While these changes are shaped by common biological mechanisms, they are also highly individualized, influenced by factors such as lifestyle, and neurological conditions like Alzheimer’s Disease (AD), Parkinson’s disease, tumors, and stroke. Understanding the progression of these changes—both at the individual level and across populations—is critical for advancing our knowledge of healthy aging and the dynamics of neurodegenerative disease.

      To study how brain structure evolves over time, researchers rely on longitudinal neuroimaging: repeated imaging of the same individuals at multiple timepoints. Unlike cross-sectional imaging, which captures a single snapshot per subject, longitudinal scans provide a temporal sequence that enables direct observation of anatomical trajectories. These sequences allow for the measurement of rates of change, identification of early biomarkers, and modeling of disease progression in a subject-specific manner.

      However, acquiring complete longitudinal datasets in practice remains challenging. Subject dropout, missed clinical visits, and protocol variability often result in missing scans, interrupting the temporal continuity required for accurate modeling. These gaps limit the effectiveness of methods that rely on temporally complete inputs and can bias downstream analyses. Imputing the missing scan to complete the subject’s imaging timeline is therefore a critical step toward enabling robust longitudinal modeling and improving our understanding of neurodegenerative processes.

       

      Year of completion:  June 2025
       Advisor : Professor Jayanthi Sivaswamy

      Related Publications


        Downloads

        Cinematic Video Editing: Integrating Audio-Visual Perception and Dialogue Interpretation


        Rohit Girmaji

        Abstract

        This thesis focuses on advancing automated video editing by analyzing raw, unedited footage to extract essential information such as speaker detection, video saliency, and dialogue interpretation. At the core of this work is EditIQ, an automated video editing pipeline that leverages speaker cues, saliency predictions, and large language model (LLM)-based dialogue understanding to optimize shot selection—the critical step in the editing process.

        The study begins with a comprehensive assessment of active speaker detection techniques tailored for automated editing. Using the BBC Old School Dataset, annotated with active speaker information, we propose a robust audio-based nearest-neighbor algorithm that integrates facial and audio features. This approach reliably identifies speakers even under challenging conditions such as occlusions and noise, outperforming existing methods and closely aligning with manual annotations.

        In the domain of video saliency prediction, we present ViNet-S and ViNet-A, compact yet effective models designed to predict saliency maps and identify salient regions in video frames. These models are computationally efficient, balancing high accuracy with reduced model complexity.

        Starting with a static, wide-angle camera feed, EditIQ generates multiple virtual camera feeds, mimicking a team of cinematographers. Speaker detection, saliency-based scene understanding, and LLMsdriven dialogue analysis guide shot selection, which is formulated as an energy minimization problem. This optimization ensures cinematic coherence, smooth transitions, and narrative clarity in the final output.

        The efficacy of EditIQ is validated through a psychophysical study involving twenty participants using the BBC Old School dataset. Results demonstrate EditIQ’s ability to produce aesthetically compelling and narratively coherent edits, surpassing competing baselines and showcasing its potential to transform raw footage into polished cinematic narratives.

        Year of completion:  June, 2025
         Advisor : Prof. Vineet Gandhi

        Related Publications


          Downloads

          thesis

          Towards understanding Compositionality in Vision-Language Models


          Darshana S

          Abstract

          Human intelligence relies on compositional generalization: the ability to interpret novel situations by flexibly combining familiar concepts and relational structures. This thesis investigates compositionality in vision-language models (VLMs), focusing on their ability to understand and generalise across visual (images, videos) and linguistic inputs.

          In the first part, we introduce VELOCITI, a benchmark for evaluating compositional understanding in video-language models through a suite of entailment tasks. Unlike prior compositionality benchmarks constrained to single-agent videos, VELOCITI captures the complexity of real-world videos involving multiple agents and dynamic interactions. VELOCITI assesses how well models recognize and bind agents, actions, and temporal events using both text-inspired and in-video counterfactual negations.

          In the second part, we probe the internal activations of VLMs to understand how concepts in an image are bound to their attributes and references in text. Extending the Binding ID mechanism in language models, we demonstrate that VLMs construct binding ID vectors in the activations of both image tokens and their textual references, enabling in-context concept association.

          Together, these contributions advance our understanding of compositional reasoning in VLMs and offer tools for probing their capabilities.

          Year of completion:  June 2025
           Advisor : Prof. Vineet Gandhi

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Face Sketch Generation and Recognition
            2. Coreference Without Bells and Whistles
            3. Predictive Modeling of Accident-Prone Road Zones and Action Recognition in Unstructured Traffic Scenarios using ADAS Systems at Population Scale
            4. Ads and Anomalies: Structuring the Known and Probing the Unknown
            • Start
            • Prev
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. MS Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.