CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Efficient Multimodal Video Representation Learning Through Language


Darshan Singh S

Abstract

This work presents several contributions to video representation learning and related multimodal tasks, addressing key challenges in datasets, efficient model adaptation using less data, and compositional and fine-grained visual understanding. Despite the rapid growth of online lecture videos in the past several years, video-language research has primarily focused on instructional videos/movies, resulting in a scarcity of specialized datasets for educational lecture videos. To address this, we first introduce AV Lectures, a large-scale dataset of STEM lecture videos. It consists of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Next, we propose a novel unsupervised temporal segmentation task to segment lecture videos into bite-sized topics. We show that multimodal cues can be effectively utilized to learn lecture-aware representations for this task, facilitating a richer analysis of educational content. Next, we address the inefficiency of adapting pre-trained models like CLIP to videos. Existing methods typically rely on large-scale, sparsely annotated video caption datasets, resulting in slow and data-intensive adaptation. We propose SRL-CLIP, a novel approach that leverages the rich, structured semantic information within Semantic Role Labels (SRLs) for highly efficient adaptation. We use VidSitu for adaptation as it provides dense SRL annotations that holistically represent the entire video. SRL-CLIP achieves comparable or superior performance on various video understanding benchmarks (zero-shot retrieval, situation recognition, dense video captioning, and localization) compared to state-of-the-art models that possess 4−8× more parameters and are post-pretrained on up to 4000× times more data. To further explore the models’ understanding of visual content, we introduce three novel benchmarks. First, VELOCITI evaluates the compositional reasoning abilities of video-language models, focusing on their ability to bind semantic concepts through time. Second, we introduce NDLB, a framework aimed at improving fine-grained image captioning, which uses self-retrieval as a key component along with a new benchmark to check if the model can capture subtle visual distinctions. Finally, we introduce D3, a benchmark specifically designed to evaluate the fine-grained visual discrimination capabilities of MLLMs using self-retrieval, further pushing the boundaries of fine-grained visual understanding. These contributions, which include novel datasets, efficient training recipes, and insightful benchmarks, collectively advance the state of the art in multimodal and video representation learning.

 

Year of completion:  December 2024
 Advisor : Jawahar C V

Related Publications


    Downloads

    thesis

     

    Weakly Supervised and Deep Learning Methods for Histopathological Image Classification in Neurological and Renal Disorders


    R Anirudh Reddy

    Abstract

    The analysis of digital histopathology slides or Whole Slide Images (WSIs) is critical for several diagnoses. Recent advancements in computational techniques, particularly in the field of digital pathology, have shown promise in automating the classification process. Whole Slide Imaging (WSI), combined with deep learning and modern computer vision techniques, has emerged as a powerful tool in this domain. This thesis addresses two major medical challenges using deep learning and computer vision techniques: the classification of Lupus Nephritis (LN) and low-grade gliomas into their respective subtypes. Systemic lupus erythematosus (SLE) is an autoimmune disease wherein the patient’s immune system attacks healthy tissues, leading to Lupus Nephritis (LN), a severe condition causing renal failure. Traditional methods for diagnosing LN require meticulous pathological assessment of renal biopsies, which is time-consuming. In the first architecture (chapter 3), We propose a novel pipeline that automates this process by: 1) detecting various glomerular patterns in WSIs using Periodic Acid-Schiff (PAS) stained images, and 2) classifying each image based on these extracted glomerular features. This approach leverages deep learning to improve the accuracy and efficiency of LN classification. Low-grade glioma, a type of brain tumor originating from glial cells, also presents significant diagnostic challenges due to the large size and complexity of WSIs. In the second architecture(chapter 4), our work involves the classification of low-grade gliomas into Astrocytoma and Oligodendroglioma. Given the computational infeasibility of training deep learning models on gigapixel images, we adopt a weakly supervised method to extract discriminative patches from WSIs, which represent the tumor regions. A Convolutional Neural Network (CNN) is then trained on these discriminative patches, and the results are aggregated to determine the WSI label. Evaluated on a dataset of 581,616 patches from 286 WSIs obtained from The Cancer Genome Atlas (TCGA) portal, our method achieved a slide-wise accuracy of 79.31%, which increased to 89.65% when trained only on discriminative patches.The methodologies presented in this thesis not only demonstrate significant improvements in classification accuracy but also offer scalable and efficient solutions for enhancing the diagnostic processes in pathology, ultimately contributing to better patient outcomes and more efficient healthcare deliver.

    Year of completion:  December 2024
     Advisor : Jawahar C V

    Related Publications


      Downloads

      thesis

       

      Editing Neural Radiance Fields


      Rahul Goel

      Abstract

      Neural Radiance Fields (NeRFs) have emerged as a pivotal advancement in computer graphics and vision. They provide a framework for rendering highly detailed novel view images from sparse multi- view input data. NeRFs use a continuous function to represent scenes that can be estimated using neural networks. This approach enables the generation of photorealistic images for static scenes. Outside the domain of image synthesis, NeRFs have been widely adopted as a representation of several downstream including but not limited to scene understanding, augemented reality, scene nav- igation, segmentation, and 3D asset generation. In this thesis, we explore upon the segmentation and editing capabilities in radiance fields. We propose a fast style transfer method that leverages multi-view consistent generation of stylized priors to change the appearance vectors in a Tensorial Radiance Field. Our method promises a speed-up of several orders of magnitude in applying style transfer and adheres to the colorscheme from the style image better than previous works. Next, we tackle the task of segmentation in radiance fields. Our method uses a grid-based feature field which allows extremely fast feature querying and searching. Combined with our stroke-based seg- mentation, this allws the user to interactively segment objects in a captured radiance field. We improve the state-of-the-art in terms of segmentation quality by a huge margin and in terms of segmentation time by orders of magnitude. Our method enables basic editing capabilities like translation, appearance editing, removal, and composition for which we show preliminary results. We further explore the problem of composition of radiance fields. Composition of two radiance fields using ray marching requires twice the amount of memory and compute. We use distillation to fuse multiple radiance fields into one to circumvent this problem. Our distillation process is roughly thrice as fast as re-training and produces a unified representation for radiance fields.

      Year of completion:  April 2024
       Advisors : P J Narayanan

      Related Publications


        Downloads

        thesis

        Neural Fields for Hand-object Interactions


        Chandradeep Pokhariya

        Abstract

        The hand is the most commonly used body part for interacting with our three-dimensional world. While it may seem ordinary, replicating hand movements with robots or in virtual/augmented reality is highly complex. Research on how hands interact with objects is crucial for advancing robotics, virtual reality, and human-computer interaction. Understanding hand movements and manipulation is critical to creating more intuitive and responsive technologies, which can significantly improve accuracy, efficiency, and scalability in various industries. Despite extensive research, programming robots to mimic human-hand interactions remains a challenging goal. One of the biggest challenges is collecting accurate 3D data for hand-object grasping. This process is complicated because of the hand’s flexibility and how hands and objects occlude in grasping poses. Collecting such data often requires expensive and sophisticated setups. However, recently, neural fields [1] have emerged, which can model 3D scenes using only multi-view images or videos. Neural fields use a continuous neural function to represent 3D scenes without needing 3D ground truth data, relying instead on differentiable rendering and multi-view photometric loss. With growing interest, these methods are becoming faster, more efficient, and better at modeling complex scenes. This thesis explores how neural fields can address two specific subproblems in hand-object interaction research. The first problem is generating novel grasps, which means predicting the final grasp pose of a hand based on its initial position and the object’s shape and location. The challenge is creating a generative model that can predict accurate grasp poses using only multi-view videos without 3D ground truth data. To solve this, we developed RealGrasper, a generative model that learns to predict grasp poses from multi-view data using photometric loss and other regularizations. The second problem is accurately capturing grasp poses and extracting contact points from multi-view videos. Current methods use the MANO model [2], which approximates hand shapes but lacks the details for precise contacts. Additionally, there is no easy way to get ground truth data for evaluating contact quality. To address this, we propose MANUS, a method for markerless grasp capture using articulated 3D Gaussians that reconstructs high-fidelity hand models from multi-view videos. We also created a large dataset, MANUS-Grasps, which includes multi-view videos of three subjects grasping over 30 objects. Furthermore, we developed a new way to capture and evaluate contacts, providing a contact metric for better assessment. We thoroughly evaluated our methods through detailed experiments, ablations, and comparisons, demonstrating that our approach outperforms existing state-of-the-art methods. We also summarize our contributions and discuss potential future directions in this field. We believe this thesis will help advance the research community further.

        Year of completion:  June 2024
         Advisors : Avinash Sharma,Srinath Sridhar

        Related Publications


          Downloads

          thesis

          Vulnerability of Neural Network based Speaker Recognition Systems


          Ritu Srivastava

          Abstract

          Speaker recognition (SR) involves automatic identification of individual speakers based on their voices, often representing acoustic traits as fixed-dimensional vectors through speaker embedding. A standard speaker recognition system (SRS) consists of three key phases: training, enrollment, and recognition. In each stage, acoustic features are extracted from raw speech signals using an acoustic feature extraction module, resulting in the acquisition of essential acoustic characteristics. Commonly used acoustic features include speech spectrogram, filter bank, and Mel-frequency cepstral coefficients. During the training stage, a background model is trained to establish a mapping from training voices to embeddings. The traditional background model employs a Gaussian Mixture Model (GMM) to generate identity-vector (ivector) embeddings. In contrast, more recent and promising background models leverage deep neural networks (DNNs) to generate deep embeddings, like xvector. In the enrollment stage, a voice spoken by an individual undergoing enrollment is mapped to an enrollment embedding using the previously trained background model. In the recognition stage, the process begins by retrieving the testing embedding of a given voice from the background model. Subsequently, the scoring module is engaged to measure the similarity between the enrollment and testing embeddings. The scoring module evaluates the similarity between the speaker and recorded embedding. Following the assessment, the scoring and decision module makes a decision based on the similarity score. A decision threshold is established, which serves as a criterion to determine whether the claimed identity of the speaker is accepted or rejected. The concept of voiceprint is rapidly gaining prominence as one of the emerging biometrics, primarily owing to its seamless integration with natural and human-centered Voice User Interface (VUI). The fast progress of Speaker Recognition Systems (SRSs) is intricately linked to the evolution of Neural Networks (NNs), with a particular emphasis on Deep Neural Networks (DNNs). With strides made in deep learning, Speaker Recognition (SR) has also benefitted and found extensive applications across hardware and software platforms. However, it has been shown that NNs are vulnerable to adversarial attacks, highlighting a challenge that needs to be addressed. Thus, even though users have the convenience of authentication with Speaker Recognition services, it has become evident that these solutions are vulnerable to adversarial attacks. This vulnerability highlights that Speaker Recognition (SR) is encountering security threats, raising significant concerns about user privacy. Adversarial attack was initially implemented with images, where an image classification model was successfully deceived using adversarial examples. Drawing inspiration from the progress made in adversarial attacks within the image domain, there is a growing interest in extending these techniques to the audio field. With emerging trends, convolutional neural networks have demonstrated instability to artificially crafted perturbations that remain undetectable to the human eye. Virtually every type of model, ranging from CNN to graphical neural network (GNN), has shown vulnerability to adversarial examples, particularly in the domain of image classification. Deep learning models typically get audio input by converting the audio into a spectrogram for further processing. A spectrogram serves as a condensed representation of an audio input. Given its image-like nature, the audio spectrogram is frequently used as input data for deep learning models, especially Convolutional Neural Networks (CNNs) adapted for audio tasks. CNN-based architectures were initially designed for image processing. This thesis contributes to the assessment of Convolutional Neural Networks (CNNs) for their resilience against adversarial attacks, a domain that is yet to be extensively investigated concerning endto-end trained CNNs for speaker recognition. This examination is essential for sustaining the integrity and security of speaker recognition systems. Our study fills this gap by exploring the variations of iterative Fast Gradient Sign Method (FGSM) to carry out adversarial attacks. We note that using a vanilla iterative FGSM technique can alter the identity of each speaker sample to any other speaker within the LibriSpeech dataset. Additionally, we introduce adversarial attacks specific to Mel spectrogram features by (a) constraining the number of manipulated pixels, (b) confining alterations to certain frequency bands, (c) limiting changes to particular time segments, and (d) employing a substitute model to generate the adversarial sample. Through comprehensive qualitative and quantitative analyses, we illustrate the vulnerability and counterintuitive behavior of existing CNN-based speaker recognition systems, wherein the predicted speaker identities can be inverted without discernible alterations in the audio. The samples are available at “https://advdemo.github.io/speech/".

          Year of completion:  June 2024
           Advisor : Vineet Gandhi

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Beyond Text: Expanding Speech Synthesis with Lip-to-Speech and Multi-Modal Fusion
            2. Unsupervised Learning of Disentangled Video Representation for Future Frame Prediction
            3. Targeted Segmentation: Leveraging Localization with DAFT for Improved Medical Image Segmentation
            4. Estimating 3D Human Pose, Shape, and Correspondences from Monocular Input
            • Start
            • Prev
            • 1
            • 2
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. MS Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.