Efficient Multimodal Video Representation Learning Through Language
Darshan Singh S
Abstract
This work presents several contributions to video representation learning and related multimodal tasks, addressing key challenges in datasets, efficient model adaptation using less data, and compositional and fine-grained visual understanding. Despite the rapid growth of online lecture videos in the past several years, video-language research has primarily focused on instructional videos/movies, resulting in a scarcity of specialized datasets for educational lecture videos. To address this, we first introduce AV Lectures, a large-scale dataset of STEM lecture videos. It consists of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Next, we propose a novel unsupervised temporal segmentation task to segment lecture videos into bite-sized topics. We show that multimodal cues can be effectively utilized to learn lecture-aware representations for this task, facilitating a richer analysis of educational content. Next, we address the inefficiency of adapting pre-trained models like CLIP to videos. Existing methods typically rely on large-scale, sparsely annotated video caption datasets, resulting in slow and data-intensive adaptation. We propose SRL-CLIP, a novel approach that leverages the rich, structured semantic information within Semantic Role Labels (SRLs) for highly efficient adaptation. We use VidSitu for adaptation as it provides dense SRL annotations that holistically represent the entire video. SRL-CLIP achieves comparable or superior performance on various video understanding benchmarks (zero-shot retrieval, situation recognition, dense video captioning, and localization) compared to state-of-the-art models that possess 4−8× more parameters and are post-pretrained on up to 4000× times more data. To further explore the models’ understanding of visual content, we introduce three novel benchmarks. First, VELOCITI evaluates the compositional reasoning abilities of video-language models, focusing on their ability to bind semantic concepts through time. Second, we introduce NDLB, a framework aimed at improving fine-grained image captioning, which uses self-retrieval as a key component along with a new benchmark to check if the model can capture subtle visual distinctions. Finally, we introduce D3, a benchmark specifically designed to evaluate the fine-grained visual discrimination capabilities of MLLMs using self-retrieval, further pushing the boundaries of fine-grained visual understanding. These contributions, which include novel datasets, efficient training recipes, and insightful benchmarks, collectively advance the state of the art in multimodal and video representation learning.
Year of completion: | December 2024 |
Advisor : | Jawahar C V |
Related Publications
Downloads