CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Revolutionizing TV Show Experience: Using Recaps for Multimodal Story Summarization


Aditya Kumar Singh

Abstract

We introduce a novel approach for multimodal story summarization, aimed at leveraging TV episode recaps to create concise summaries of complex storylines. These recaps, which consist of short video sequences combining key visual moments and dialogues from previous episodes, serve as a valuable source of weak supervision for labeling the summarization task. To facilitate this approach, we introduce the PlotSnap dataset, which focuses on two crime thriller TV shows. Each episode in this dataset is over 40 minutes long and is accompanied by rich recaps. These recaps are mapped to corresponding sub-stories, providing labels for the story summarization task. Our proposed model, TaleSumm, operates hierarchically. (i) First, it processes entire episodes by generating compact representations of shots and dialogues. (ii) Then, it predicts the importance scores for each video shot and dialog utterance, taking into account interactions between local story groups. Unlike traditional summarization tasks, our method extracts multiple plot points from long-form videos. We conducted a comprehensive evaluation of our approach, including assessing its performance in crossseries generalization. TaleSumm demonstrates promising results, not only on the video summarization benchmarks but also in effectively summarizing the intricate storylines of the TV shows in the PlotSnap dataset. Our project implementation as well as dataset features and demo can be found at https: //github.com/katha-ai/RecapStorySumm-CVPR2024.

Year of completion:  April 2024
 Advisor : C V Jawahar

Related Publications


    Downloads

    thesis

    Revisiting Synthetic Face Generation for Multimedia Applications


    Aditya Agarwal

    Abstract

    Videos have become an integral part of our daily digital consumption. With the widespread adoption of mobile devices, internet connectivity, and social media platforms, the number of online users and consumers has risen exponentially in recent years. This has led to an unprecedented surge in video content consumption and creation, ranging from short-form content on TikTok to educational material on Coursera and entertainment videos on YouTube. Consequently, there is an urgent need to study videos as a modality in Computer Vision, as it can enable a multitude of applications across various domains, including virtual reality, education, and entertainment. By understanding the intricacies of video content, we can unlock its potential and leverage its benefits to enhance user experiences and create innovative solutions. Producing video content at scale can be challenging due to various practical issues. The recording process can take several hours of practice, and setting up the right studio and camera equipment can be time-consuming and expensive. Moreover, recording requires manual effort, and any mistakes made during the shoot can be difficult to rectify or modify, often requiring the entire video to be re-shot. In this thesis, we aim to ask the question “Can synthetically generated videos take the place of real videos?” as automatic content creation can significantly scale digital media production and ease the process of content creation that can aid several applications. A form of human-centric representation that is becoming increasingly popular in the research community is the ability to generate talking-head videos automatically. Talking-head generation refers to the ability to generate realistic videos of a person speaking, where the generated video can be of a person that may not exist in reality or may exhibit significantly different characteristics than the original person. Recent deep learning approaches can synthesize synthetic talking-head videos at tremendous scale and quality, with diverse content and styles, that are visually indistinguishable from real videos. Therefore, it is imperative to study the process of generating talking-head videos as these videos can be used for a variety of applications, such as video conferencing, movie-making, broadcasting news, vlogging, and language learning among others. Consider a digital avatar reading news from a text transcript being broadcasted on news. In this vein, this thesis aims to explore two prominent use cases of generating synthetic talkingheads automatically - the first one towards generating large-scale synthetic content to aid people in lipreading at scale. The second use case is for automating the task of actor-double face-swapping in the moviemaking industry. We study and elucidate the challenges and limitations of the existing approaches, propose solutions based on synthetic talking head generation, and show the superiority of our methods through extensive experimental evaluation and user studies. In the first task, we address the challenges associated with learning to lipread. Lipreading is a primary mode of communication for people suffering from some form of hearing loss. Therefore, learning to lipread is an important aspect for hard-of-hearing people. However, learning to lipread is not an easy task and finding resources to improve one’s lipreading skills can be challenging. Existing lipreading training websites that provide basic online resources to improve lipreading skills, are unfortunately, limited by real-world variations in the talking faces, cover only a limited vocabulary, and are available in a few select languages and accents. This leaves the vast majority of users without access to adequate lipreading training resources. To address this challenge, we propose an end-to-end pipeline to develop an online lipreading training platform using state-of-the-art talking head video generator networks, textto-speech models, and computer vision techniques, to increase the amount of online content on the LRT platforms in an automated and cost-effective manner. We show that incorporating existing talking heading generator networks for the task of lipreading is not trivial, and requires careful adaptation. For instance, we develop an audio-video alignment module that aligns the speech utterance on the region with the mouth movements and adds silence around the aligned utterance. Such modifications are necessary to generate realistic-looking videos that don’t cause distress to the lipreaders. We also design carefully thought out lipreading training exercises, conduct extensive user studies, and perform statistical analysis to show the effectiveness of the generated content in replacing the manually recorded lipreading training videos. In the second problem, we address challenges in the entertainment industry. Body doubles play an indispensable role in the moviemaking industry. They take the place of actors in dangerous stunt scenes and in scenes where the same actor plays multiple characters. In all these scenes, the double’s face is later replaced by the actor’s face and expressions using CGI technology requiring hundreds of hours of manual multimedia edits on heavy graphical units costing millions of dollars and taking months to complete. As we show in this thesis, automated face-swapping approaches based on deep learning models are not suitable for the task of actor-double face-swapping, as they fail to preserve the actor’s expressions. To address this, we introduce “video-to-video (V2V) face-swapping”, a novel task of face-swapping that aims to (1) swap the identity and expressions of a source face video, and (2) retain the pose and background of the target face video. Our key technical contribution lies in i) devising a self-supervised training strategy, which uses a single video as the source and target, introduces pseudo motion errors on the source video, and the network fixes these pseudo errors to regenerate the source video; and ii) we build temporal autoencoding models inspired by VQVAE-2, that take two different motions as input, and produce a third coherent output motion. In summary, this thesis unravels several tasks enabled by synthetic talking-head generation, and provides solutions for the lipreading community and the moviemaking industry. Our findings concretely point toward the notion of replacing real human talking-head videos with synthetically generated videos, thereby, scaling digital content creation to new heights, saving precious time and resources, and easing the life of humans.

    Year of completion:  October 2023
     Advisor : C V Jawahar, Vinay P Namboodiri

    Related Publications


      Downloads

      thesis

      Advancing Domain Generalization through Cross-Domain Class-Contrastive Learning and Addressing Data Imbalances


      Saransh Dave

      Abstract

      This thesis delves into the critical field of Domain Generalization (DG) in machine learning, where models are trained on multiple source distributions with the objective of generalizing to unseen tar- get distributions. We begin by dissecting various facets of DG, including distribution shifts, shortcut learning, representation learning, and data imbalances. This foundational investigation sets the stage for understanding the challenges associated with DG and the complexities that arise. A comprehensive literature review is conducted, highlighting existing challenges and contextualizing our contributions to the field. The review encompasses learning invariant features, parameter sharing techniques, meta-learning techniques, and data augmentation approaches. One of the key contributions of this thesis is the examination of the role low-dimensional representa- tions play in enhancing DG performance. We introduce a method to compute the implicit dimensionality of latent representations, exploring its correlation with performance in a domain generalization context. This essential finding motivated us to further investigate the effects of low-dimensional representations. Building on these insights, we present Cross-Domain Class-Contrastive Learning (CDCC), a tech- nique that learns sparse representations in the latent space, resulting in lower-dimensional represen- tations and improved domain generalization performance. CDCC establishes competitive results on various DG benchmarks, comparing favorably with numerous existing approaches in DomainBed. Venturing beyond traditional DG, we discuss a series of experiments conducted for domain general- ization in long-tailed settings, which are common in real-world applications. Additionally, we present supplementary experiments yielding intriguing findings. Our analysis reveals that the CDCC approach exhibits greater robustness in long-tailed distributions and that the order of performances across test do- mains remains unaffected by the order of training domains in the long-tailed setting. This section aims to inspire researchers to further probe the outcomes of these experiments and advance the understanding of domain generalization. In conclusion, this thesis offers a well-rounded exploration of DG by combining a comprehensive literature review, the discovery of the importance of low-dimensional representations in DG, the devel- opment of the CDCC method, and the meticulous analysis of long-tailed settings and other experimental findings.

      Year of completion:  October 2023
       Advisor : Vineet Gandhi

      Related Publications


        Downloads

        thesis

        Real-Time Video Processing for Dynamic Content Creation


        Sudheer Achary

        Abstract

        Autonomous camera systems are vital in capturing dynamic events and creating engaging videos. However, existing filtering techniques used to stabilize and smoothen camera trajectories often fail to replicate the natural behavior of human camera operators. To address these challenges, our work proposes novel approaches for real-time camera trajectory optimization and gaze-guided video editing. We introduce two online filtering methods: CineConvex and CineCNN. CineConvex utilizes a sliding window-based convex optimization formulation, while CineCNN employs a convolutional neural network as an encoder-decoder model. Both methods are motivated by cinematographic principles, producing smooth and natural camera trajectories. Evaluation of basketball and stage performance datasets demonstrates superior performance over previous methods and baselines, both quantitatively and qualitatively. With a minor latency of half a second, CineConvex operates at approximately 250 frames per second (fps), while CineCNN achieves an impressive speed of 1000 fps, making them highly suitable for real-time applications. In the realm of video editing, we present Real Time GAZED, a real-time adaptation of the GAZED framework. It enables users to create professionally edited videos in real-time. Comparative evaluations against baseline methods, including the non-real-time GAZED, demonstrate that Real Time GAZED achieves similar editing results, ensuring high-quality video output. Furthermore, a user study confirms the aesthetic quality of the video edits produced by Real Time GAZED. With the advancements in real-time camera trajectory optimization and video editing presented, the demand for immediate and dynamic content creation in industries such as live broadcasting, sports coverage, news reporting, and social media content creation can be met more efficiently. The elimination of time-consuming post-production processes and the ability to deliver high-quality videos in today’s fast-paced digital landscape are the key advantages offered by these real-time approaches

        Year of completion:  November 2023
         Advisor : Vineet Gandhi

        Related Publications


          Downloads

          thesis

          Nerve Block Target Localization and Needle Guidance for Autonomous Robotic Ultrasound Guided Regional Anesthesia


          ABHISHEK TYAGI

          Abstract

          Ultrasound guided regional anesthesia (UGRA) involves approaching target nerves through a needle in real-time, enabling precise deposition of drug with increased success rates and fewer complications. Development of autonomous robotic systems capable of administering UGRA is desirable for remote settings and localities where anesthesiologists are unavailable. Real-time segmentation of nerves, needle tip localization and needle trajectory extrapolation are required for developing such a system. In the first part of this thesis, we developed models to localize nerves in the ultrasound domain using a large dataset. Our prospective study enrolled 227 subjects who were systematically scanned for brachial plexus nerves in various settings using three different ultrasound machines to create a dataset of 227 unique videos. In total, 41,000 video frames were annotated by experienced anaesthesiologists using partial automation with object tracking and active contour algorithms. Four baseline neural network models were trained on the dataset and their performance was evaluated for object detection and segmentation tasks. Generalizability of the best suited model was then tested on the datasets constructed from separate ultrasound scanners with and without fine-tuning. The results demonstrate that deep learning models can be leveraged for real time segmentation of brachial plexus in neck ultrasonography videos with high accuracy and reliability. Using these nerve segmentation predictions, we define automated anesthesia needle targets by fitting an ellipse to the nerve contours. The second part of this thesis focuses on localization of the needles and development of a framework to guide the needles toward their targets. For the segmentation of the needle, a natural RGB pre-trained neural network is first fine-tuned on a large ultrasound dataset for domain transfer and then adapted for the needle using a small dataset. The segmented needle’s trajectory angle is calculated using Radon transformation and the trajectory is extrapolated from the needle tip. The intersection of extrapolated trajectory with the needle target guides the needle navigation for drug delivery. The needle trajectory’s average angle error was 2 o , average error in trajectory’s distance from center of the image was 10 pixels (2 mm) and the average error in needle tip was 19 pixels (3.8 mm) which is within acceptable range of 5 mm as per experienced anesthesiologists. The entire dataset has been released publicly for further study by the research community.

          Year of completion:  November 2023
           Advisor : Jayanthi Sivaswamy

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Data exploration, Playing styles, and Gameplay for Cooperative Partially Observable games: Pictionary as a case study
            2. Security from uncertainty: Designing privacy-preserving verification methods using Noise
            3. Towards Enhancing Semantic Segmentation in Resource Constrained Settings
            4. High-Quality 3D Fingerprint Generation: Merging Skin Optics, Machine Learning and 3D Reconstruction Techniques
            • Start
            • Prev
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.