CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Text-based Video Question Answering


Soumya Shamarao Jahagirdar

Abstract

Think of a situation where you put yourself in the shoes of a visually impaired person who wants to buy an item from a store or a person who is sitting in their house and watching the news on the television and wants to know about the content of the news being broadcast. Motivated by many more such situations where creating systems capable of understanding and reasoning over textual content in the videos, in this thesis, we tackle the novel problem of text-based video question answering. Vision and Language are broadly regarded as cornerstones of intelligence. Though each of these has different aims – language has the purpose of communication, and transmission of information, and vision has the purpose of constructing mental representations of the scene around us to navigate and interact with objects. When we study both of these fields jointly, it can result in applications, tasks, and methods that, when combined go beyond the scope compared to when they are used individually. This inter-dependency is being studied as a newly emerging area of a study named “multi-modal understanding”. Many tasks such as image captioning, visual question answering, video question answering, text-video retrieval, and more fall under the category of multi-modal understanding and reasoning tasks. To have a system that can reason over both text-based information and temporal-based information, we propose a new task. The first portion of this thesis focuses on the formulation of the text-based VideoQA task, by first analyzing the current datasets and works and thereby arriving at the need for text-based VideoQA. To this end, we propose the NewsVideoQA dataset where the question-answer pairs are framed on the text present in the news videos. As this is a new task proposed, we experiment with existing methods such as text-only models, single-image scene text-based models, and video question-answering models. As these baseline methods were not originally designed for the task of video question-answering using text in the videos, the need for a video question-answering model that can take the text in the videos into account to obtain answers became the need. To this end, we repurpose the existing VideoQA model to incorporate OCR tokens namely – OCR-aware SINGULARITY, a video question-answering framework that learns joint representations of videos and OCR tokens at the pretraining stage and also uses the OCR tokens at the finetuning stage. In this second portion of the thesis, we look into the M4-ViteVQA dataset which aims to solve the same task of text-based video question-answering but the videos belong to multiple categories such as shopping, traveling, vlogging, gaming, and so on. We perform a data exploratory analysis where we analyze both NewsVideoQA and M4-ViteVQA on several aspects that look for limitations in these datasets. Through the data exploratory experiment, we show that most of the questions in both datasets have questions that can be answered just by reading the text present in the videos. We also observe that most of the questions can be answered using a single to few frames in the videos. We perform an exhaustive analysis on a text-only model: BERT-QA which obtains comparable results to the multimodal methods. We also perform cross-domain experiments to check if training followed by finetuning on two different categories of videos helps the target dataset. In the end, we also provide some insights into creating a dataset and how certain types of annotations can help the community come up with better datasets in the future. We hope this work motivates future research on text-based video question-answering in multiple video categories. Furthermore, the pretraining strategies and combined representation learning from these videos and the multiple modalities that videos provide us will help create scalable systems and drive future research towards better datasets and creative solutions.

Year of completion:  March 2024
 Advisor : C V Jawahar

Related Publications


    Downloads

    thesis

    Analytic and Neural Approaches for Complex Light Transport


    Ishaan Shah

    Abstract

    The goal of rendering is to produce a photorealistic image of the given 3D scene description. Physically based rendering simulates the physics of the light as it travels and interacts with objects in the scene before finally reaching the camera sensor. Monte Carlo methods have been the go-to approach for physically based rendering. They are general and robust but introduce noise and are computationally expensive. Recent advancements in hardware, algorithms, and denoising techniques have enabled real-time applications of Monte Carlo methods. However, complex scenes still demand high sample counts. In this thesis, we explore and present the utilization of analytic and neural approaches for physically based rendering. Analytic methods offer noise-free renderings but are less general and may introduce bias. In recent years, neural-based approaches have gained traction, offering a balance between generality and computational efficiency. We compare and contrast the traditional Monte Carlo-based methods and emerging analytic and neural network-based methods. We then propose analytic and neural solutions to two challenging cases: direct lighting with many area lights and efficient rendering of glinty appearances on specular normal-mapped surfaces. Direct lighting from many area light sources is challenging due to variance from both choosing an important light and then a point on it. Existing methods weigh the contribution of all lights by estimating their effect on the shading point. We propose to extend one such method by using analytic methods to improve the estimation of the light’s contribution. This enhancement accelerates the convergence of the algorithm, making it more efficient for scenes with many dynamic lights. The second case deals with the challenge of rendering glinty appearances on normal mapped specular surfaces efficiently. Traditional Monte Carlo methods struggle with this task due to the rapidly changing spatial characteristics of microstructures. Our solution introduces a novel method supporting spatially varying roughness based on a neural histogram, offering both memory and compute efficiency. Additionally, full direct illumination integration is computed analytically for all light directions with minimal computational effort, resulting in improved quality compared to previous approaches. Through comprehensive analysis and experimentation, this thesis contributes to the advancement of rendering techniques, shedding light on the trade-offs between different methods and providing insights into their practical applications for achieving photorealistic rendering.

    Year of completion:  March 2024
     Advisor : P J Narayanan

    Related Publications


      Downloads

      thesis

      Learning Emotions and Mental States in Movie Scenes


      Dhruv Srivastava

      Abstract

      In this thesis, we delve into the analysis of movie narratives, with a specific focus on understanding the emotions and mental states of characters within a scene. Our approach involves predicting a diverse range of emotions for individual movie scenes and each character within those scenes. To achieve this, we introduce EmoTx, a novel multimodal Transformer-based architecture that integrates video data, multiple characters, and dialogues for making comprehensive predictions. Leveraging annotations from the MovieGraphs dataset, our model is tailored to predict both classic emotions (e.g., happiness, anger) and nuanced mental states (e.g., honesty, helpfulness). Our experiments concentrate on evaluating performance across the ten most common and twenty-five most common emotional labels, along with a mapping that clusters 181 labels into 26 categories. Through systematic ablation studies and a comparative analysis against established emotion recognition methods, we demonstrate the effectiveness of EmoTx in capturing the intricacies of emotional and mental states in movie contexts. Additionally, our investigation into EmoTx’s attention mechanisms provides valuable insights. We observe that when characters express strong emotions, EmoTx focuses on character-related elements, while for other mental states, it relies more on video and dialogue cues. This nuanced understanding enhances the interpretability and contextual relevance of EmoTx in the domain of movie story analysis. The findings presented in this thesis contribute to advancing our comprehension of character emotions and mental states in cinematic narratives.

      Year of completion:  April 2024
       Advisor : Makarand Tapaswi

      Related Publications


        Downloads

        thesis

        Revolutionizing TV Show Experience: Using Recaps for Multimodal Story Summarization


        Aditya Kumar Singh

        Abstract

        We introduce a novel approach for multimodal story summarization, aimed at leveraging TV episode recaps to create concise summaries of complex storylines. These recaps, which consist of short video sequences combining key visual moments and dialogues from previous episodes, serve as a valuable source of weak supervision for labeling the summarization task. To facilitate this approach, we introduce the PlotSnap dataset, which focuses on two crime thriller TV shows. Each episode in this dataset is over 40 minutes long and is accompanied by rich recaps. These recaps are mapped to corresponding sub-stories, providing labels for the story summarization task. Our proposed model, TaleSumm, operates hierarchically. (i) First, it processes entire episodes by generating compact representations of shots and dialogues. (ii) Then, it predicts the importance scores for each video shot and dialog utterance, taking into account interactions between local story groups. Unlike traditional summarization tasks, our method extracts multiple plot points from long-form videos. We conducted a comprehensive evaluation of our approach, including assessing its performance in crossseries generalization. TaleSumm demonstrates promising results, not only on the video summarization benchmarks but also in effectively summarizing the intricate storylines of the TV shows in the PlotSnap dataset. Our project implementation as well as dataset features and demo can be found at https: //github.com/katha-ai/RecapStorySumm-CVPR2024.

        Year of completion:  April 2024
         Advisor : C V Jawahar

        Related Publications


          Downloads

          thesis

          Revisiting Synthetic Face Generation for Multimedia Applications


          Aditya Agarwal

          Abstract

          Videos have become an integral part of our daily digital consumption. With the widespread adoption of mobile devices, internet connectivity, and social media platforms, the number of online users and consumers has risen exponentially in recent years. This has led to an unprecedented surge in video content consumption and creation, ranging from short-form content on TikTok to educational material on Coursera and entertainment videos on YouTube. Consequently, there is an urgent need to study videos as a modality in Computer Vision, as it can enable a multitude of applications across various domains, including virtual reality, education, and entertainment. By understanding the intricacies of video content, we can unlock its potential and leverage its benefits to enhance user experiences and create innovative solutions. Producing video content at scale can be challenging due to various practical issues. The recording process can take several hours of practice, and setting up the right studio and camera equipment can be time-consuming and expensive. Moreover, recording requires manual effort, and any mistakes made during the shoot can be difficult to rectify or modify, often requiring the entire video to be re-shot. In this thesis, we aim to ask the question “Can synthetically generated videos take the place of real videos?” as automatic content creation can significantly scale digital media production and ease the process of content creation that can aid several applications. A form of human-centric representation that is becoming increasingly popular in the research community is the ability to generate talking-head videos automatically. Talking-head generation refers to the ability to generate realistic videos of a person speaking, where the generated video can be of a person that may not exist in reality or may exhibit significantly different characteristics than the original person. Recent deep learning approaches can synthesize synthetic talking-head videos at tremendous scale and quality, with diverse content and styles, that are visually indistinguishable from real videos. Therefore, it is imperative to study the process of generating talking-head videos as these videos can be used for a variety of applications, such as video conferencing, movie-making, broadcasting news, vlogging, and language learning among others. Consider a digital avatar reading news from a text transcript being broadcasted on news. In this vein, this thesis aims to explore two prominent use cases of generating synthetic talkingheads automatically - the first one towards generating large-scale synthetic content to aid people in lipreading at scale. The second use case is for automating the task of actor-double face-swapping in the moviemaking industry. We study and elucidate the challenges and limitations of the existing approaches, propose solutions based on synthetic talking head generation, and show the superiority of our methods through extensive experimental evaluation and user studies. In the first task, we address the challenges associated with learning to lipread. Lipreading is a primary mode of communication for people suffering from some form of hearing loss. Therefore, learning to lipread is an important aspect for hard-of-hearing people. However, learning to lipread is not an easy task and finding resources to improve one’s lipreading skills can be challenging. Existing lipreading training websites that provide basic online resources to improve lipreading skills, are unfortunately, limited by real-world variations in the talking faces, cover only a limited vocabulary, and are available in a few select languages and accents. This leaves the vast majority of users without access to adequate lipreading training resources. To address this challenge, we propose an end-to-end pipeline to develop an online lipreading training platform using state-of-the-art talking head video generator networks, textto-speech models, and computer vision techniques, to increase the amount of online content on the LRT platforms in an automated and cost-effective manner. We show that incorporating existing talking heading generator networks for the task of lipreading is not trivial, and requires careful adaptation. For instance, we develop an audio-video alignment module that aligns the speech utterance on the region with the mouth movements and adds silence around the aligned utterance. Such modifications are necessary to generate realistic-looking videos that don’t cause distress to the lipreaders. We also design carefully thought out lipreading training exercises, conduct extensive user studies, and perform statistical analysis to show the effectiveness of the generated content in replacing the manually recorded lipreading training videos. In the second problem, we address challenges in the entertainment industry. Body doubles play an indispensable role in the moviemaking industry. They take the place of actors in dangerous stunt scenes and in scenes where the same actor plays multiple characters. In all these scenes, the double’s face is later replaced by the actor’s face and expressions using CGI technology requiring hundreds of hours of manual multimedia edits on heavy graphical units costing millions of dollars and taking months to complete. As we show in this thesis, automated face-swapping approaches based on deep learning models are not suitable for the task of actor-double face-swapping, as they fail to preserve the actor’s expressions. To address this, we introduce “video-to-video (V2V) face-swapping”, a novel task of face-swapping that aims to (1) swap the identity and expressions of a source face video, and (2) retain the pose and background of the target face video. Our key technical contribution lies in i) devising a self-supervised training strategy, which uses a single video as the source and target, introduces pseudo motion errors on the source video, and the network fixes these pseudo errors to regenerate the source video; and ii) we build temporal autoencoding models inspired by VQVAE-2, that take two different motions as input, and produce a third coherent output motion. In summary, this thesis unravels several tasks enabled by synthetic talking-head generation, and provides solutions for the lipreading community and the moviemaking industry. Our findings concretely point toward the notion of replacing real human talking-head videos with synthetically generated videos, thereby, scaling digital content creation to new heights, saving precious time and resources, and easing the life of humans.

          Year of completion:  October 2023
           Advisor : C V Jawahar, Vinay P Namboodiri

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Advancing Domain Generalization through Cross-Domain Class-Contrastive Learning and Addressing Data Imbalances
            2. Real-Time Video Processing for Dynamic Content Creation
            3. Nerve Block Target Localization and Needle Guidance for Autonomous Robotic Ultrasound Guided Regional Anesthesia
            4. Data exploration, Playing styles, and Gameplay for Cooperative Partially Observable games: Pictionary as a case study
            • Start
            • Prev
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • 11
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.