CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Cinematic Video Editing: Integrating Audio-Visual Perception and Dialogue Interpretation


Rohit Girmaji

Abstract

This thesis focuses on advancing automated video editing by analyzing raw, unedited footage to extract essential information such as speaker detection, video saliency, and dialogue interpretation. At the core of this work is EditIQ, an automated video editing pipeline that leverages speaker cues, saliency predictions, and large language model (LLM)-based dialogue understanding to optimize shot selection—the critical step in the editing process.

The study begins with a comprehensive assessment of active speaker detection techniques tailored for automated editing. Using the BBC Old School Dataset, annotated with active speaker information, we propose a robust audio-based nearest-neighbor algorithm that integrates facial and audio features. This approach reliably identifies speakers even under challenging conditions such as occlusions and noise, outperforming existing methods and closely aligning with manual annotations.

In the domain of video saliency prediction, we present ViNet-S and ViNet-A, compact yet effective models designed to predict saliency maps and identify salient regions in video frames. These models are computationally efficient, balancing high accuracy with reduced model complexity.

Starting with a static, wide-angle camera feed, EditIQ generates multiple virtual camera feeds, mimicking a team of cinematographers. Speaker detection, saliency-based scene understanding, and LLMsdriven dialogue analysis guide shot selection, which is formulated as an energy minimization problem. This optimization ensures cinematic coherence, smooth transitions, and narrative clarity in the final output.

The efficacy of EditIQ is validated through a psychophysical study involving twenty participants using the BBC Old School dataset. Results demonstrate EditIQ’s ability to produce aesthetically compelling and narratively coherent edits, surpassing competing baselines and showcasing its potential to transform raw footage into polished cinematic narratives.

Year of completion:  June, 2025
 Advisor : Prof. Vineet Gandhi

Related Publications


    Downloads

    thesis

    Towards understanding Compositionality in Vision-Language Models


    Darshana S

    Abstract

    Human intelligence relies on compositional generalization: the ability to interpret novel situations by flexibly combining familiar concepts and relational structures. This thesis investigates compositionality in vision-language models (VLMs), focusing on their ability to understand and generalise across visual (images, videos) and linguistic inputs.

    In the first part, we introduce VELOCITI, a benchmark for evaluating compositional understanding in video-language models through a suite of entailment tasks. Unlike prior compositionality benchmarks constrained to single-agent videos, VELOCITI captures the complexity of real-world videos involving multiple agents and dynamic interactions. VELOCITI assesses how well models recognize and bind agents, actions, and temporal events using both text-inspired and in-video counterfactual negations.

    In the second part, we probe the internal activations of VLMs to understand how concepts in an image are bound to their attributes and references in text. Extending the Binding ID mechanism in language models, we demonstrate that VLMs construct binding ID vectors in the activations of both image tokens and their textual references, enabling in-context concept association.

    Together, these contributions advance our understanding of compositional reasoning in VLMs and offer tools for probing their capabilities.

    Year of completion:  June 2025
     Advisor : Prof. Vineet Gandhi

    Related Publications


      Downloads

      thesis

      Face Sketch Generation and Recognition


      Kushal Kumar Jain

      Abstract

      The field of sketch generation and recognition has seen significant advancements through the innovative application of generative models. This thesis presents a comprehensive exploration of face stylization , artistic portrait generation and forensic sketch synthesis, leveraging the power of stateof-the-art generative models like StyleGAN and StableDiffusion. Our work addresses key challenges in preserving identity, accommodating various poses, and bridging the modality gap between sketches and photographs. Through three interconnected studies, we demonstrate significant advancements in generating high-quality sketches and improving forensic applications.

      We begin by introducing a novel approach to face cartoonization that preserves identity and accommodates various poses. Unlike conditional-GAN methods, our technique utilizes an encoder to capture pose and identity information, generating embeddings within StyleGAN’s latent space. This approach uniquely adapts a pre-trained StyleGAN model, originally designed for realistic facial images, to produce cartoonized outputs without requiring a dedicated fine-tuned model.

      Building upon this foundation, we present Portrait Sketching StyleGAN (PS-StyleGAN), a style transfer approach tailored for portrait sketch synthesis. PS-StyleGAN leverages StyleGAN’s semantic W+ latent space to generate portrait sketches while allowing meaningful edits such as pose and expression alterations. By introducing Attentive Affine transform blocks and a specialized training strategy, we achieve high-quality sketch generation without fine-tuning StyleGAN itself. This method demonstrates superior performance over current state-of-the-art techniques, requiring only a small number of paired examples and minimal training time.

      Finally, we address the challenging task of forensic sketch-to-mugshot matching with CLIP4Sketch, a novel approach that uses diffusion models to generate diverse sketch images. By combining CLIP and Adaface embeddings of reference mugshots with textual-style descriptions, we create a comprehensive dataset of sketches corresponding to mugshots. This synthetic data significantly improves the accuracy of sketch-to-mugshot matching in face recognition systems, outperforming training on limited real face sketch data and datasets made by GAN-based methods.

      Collectively, these contributions push the boundaries of sketch generation and recognition, offering promising applications in both the artistic and forensic domains.

       

      Year of completion:  May 2025
       Advisor : Anoop Namboodiri

      Related Publications


        Downloads

        thesis

         

        Coreference Without Bells and Whistles


        S Kawshik Manikantan

        Abstract

        Coreference resolution (CR) is the task of identifying text spans that refer to the same entity. It is a fundamental component of natural language understanding with applications in various downstream NLP tasks, such as question answering, knowledge graph construction, and summarization. Despite its significance and the advancements made by neural coreference models, CR models face a major bottleneck: their limited generalization capability.

        Prior work attributes this generalization gap to differences in annotations, such as what constitutes a mention (or entity) and varying preferences to span boundaries. For a model to have strong referential capabilities, it must adapt to these annotation-specific nuances. However, achieving this level of adaptability remains a significant challenge, even for state-of-the-art (SOTA) models. This challenge is further amplified when evaluating the referential capabilities of large language models (LLMs) in a few-shot setting, where replicating nuanced annotations with just a few examples is highly unrealistic. We observe that these annotation-specific nuances, can be beneficial but are not essential for downstream tasks or for evaluating the core referential capabilities of an LLM. We describe these nuances as bells and whistles.

        In this work, we redefine the traditional formulation of coreference resolution by shifting focus away from its bells and whistles. Instead, we propose task formulations more aligned with practical applications and demonstrate improved generalizability across domains.

        Our first contribution introduces an alternative referential task, Major Entity Identification (MEI). MEI simplifies referential tasks by:(a) assuming that target entities are explicitly provided in the input, and (b) focusing exclusively on frequent entities. Assuming entities to be part of the input shifts the responsibility for domain-specific annotation adaptation—determining which entities are annotated—from the training phase to inference. Through extensive experiments, we show that MEI models generalize effectively across domains using both supervised approaches and LLM-based few-shot prompting across multiple datasets. Importantly, MEI aligns with the classification framework, enabling the use of robust, intuitive, and well-understood classification-based evaluation metrics. Beyond its theoretical appeal, MEI also has practical utility as it allows users to efficiently search for all mentions of a specific entity or a group of entities of interest.

        Our second major contribution addresses critical shortcomings identified in recent evaluations of large language models (LLMs) on coreference resolution. These studies revealed that traditional output formats and evaluation metrics fail to capture models’ referential understanding fully. Traditional evaluation methods require reproducing the entire document along with annotated cluster information or precisely replicating the antecedent span. This introduces additional bells and whistles, such as ensuring the accurate reproduction of spans and documents. To tackle this issue, we introduce IdentifyMe, a new benchmark for mention resolution that adopts a multiple-choice question (MCQ) format—a widely used evaluation approach for LLMs. With this simplified task design, any failure can now, be attributed exclusively to issues with mention resolution. IdentifyMe presents long narratives and applies heuristics to eliminate easily identifiable mentions, resulting in a more challenging and rigorous task. The benchmark incorporates a curated mix of various mention types and their corresponding entities, enabling fine-grained analysis of model performance. Notably, LLM performance remains substantially below human-level performance on IdentifyMe, highlighting considerable room for improvement even for advanced models like GPT-4. The evaluation also reveals key weaknesses in current LLMs, particularly with pronominal mentions, nested mentions, and other nuanced cases.

        Overall, this work moves beyond traditional coreference resolution formulations, focusing on tasks with practical applicability and providing fresh insights into the referential strengths and weaknesses of current models. We term this approach Coreference Without Bells and Whistles — a streamlined perspective that prioritizes utility and understanding of model capabilities over tailored annotation adaptation.

         

        Year of completion:  May 2025
         Advisor : Vineet Gandhi

        Related Publications


          Downloads

          thesis

           

          Predictive Modeling of Accident-Prone Road Zones and Action Recognition in Unstructured Traffic Scenarios using ADAS Systems at Population Scale


          Ravi Shankar Mishra

          Abstract

          This thesis addresses the critical challenge of improving road safety by introducing novel approaches to predictive modeling of accident-prone zones and action recognition in critical traffic scenarios. It makes two key contributions: the early identification of accident-prone zones using Advance Driving Assistance System (ADAS) data and the development of IDD-CRS, a comprehensive dataset for action recognition in unstructured road environments.

          In the first study, geo-tagged collision alert data from a fleet of 200 ADAS-equipped city buses in Nagpur, India, is leveraged to proactively identify high-risk zones across urban road networks. Using Kernel Density Estimation (KDE), this study captures the spatiotemporal distribution of collision alerts, enabling the detection of emerging blackspots before accidents occur. A novel recall-based metric evaluates the alignment of these predicted zones with historical blackspots, while Earth Mover Distance (EMD)-based analysis identifies previously unreported accident-prone areas. This predictive framework provides civic authorities with actionable insights for targeted interventions, such as traffic-calming measures and infrastructure improvements, thereby enhancing public safety.

          The second part of the thesis introduces the IDD-CRS dataset, a large-scale collection of traffic scenar- ios recorded using ADAS and dash cameras. IDD-CRS fills a critical gap in existing datasets by focus- ing on complex interactions between vehicles and pedestrians, with scenarios such as high-speed lane changes, unsafe vehicle approaches, and near-miss incidents. With precise temporal annotations pow- ered by ADAS technology, the dataset ensures accurate event boundaries, providing a robust benchmark for action recognition and long-tail action recognition tasks. It includes 90 hours of footage spanning 5,400 one-minute videos and 135,000 frames, with hard negative examples to challenge existing mod- els. Initial benchmarks highlight the limitations of current video backbones in recognizing rare events, emphasizing the need for further advancements.

          Together, these contributions provide a holistic framework for improving road safety through proactive accident prevention and robust action recognition in traffic scenarios. By addressing both spatial acci- dent prediction and temporal event recognition, this work offers foundational resources and actionable insights to advance research and practical solutions for safer road environments.

           

          Year of completion:  April 2025
           Advisors : Ravi Kiran Sarvadevabhatla

          Related Publications


            Downloads

            thesis

             

            More Articles …

            1. Ads and Anomalies: Structuring the Known and Probing the Unknown
            2. Advancing Motion with LLMs: Leveraging Large Language Models for Enhanced Text-Conditioned Motion Generation and Retrieval
            3. Towards Understanding Small Objects in Indian Driving Situations
            4. Multi-Object Multi-Part Scene Parsing
            • Start
            • Prev
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.