Revolutionizing TV Show Experience: Using Recaps for Multimodal Story Summarization

Aditya Kumar Singh

Abstract

We introduce a novel approach for multimodal story summarization, aimed at leveraging TV episode recaps to create concise summaries of complex storylines. These recaps, which consist of short video sequences combining key visual moments and dialogues from previous episodes, serve as a valuable source of weak supervision for labeling the summarization task. To facilitate this approach, we introduce the PlotSnap dataset, which focuses on two crime thriller TV shows. Each episode in this dataset is over 40 minutes long and is accompanied by rich recaps. These recaps are mapped to corresponding sub-stories, providing labels for the story summarization task. Our proposed model, TaleSumm, operates hierarchically. (i) First, it processes entire episodes by generating compact representations of shots and dialogues. (ii) Then, it predicts the importance scores for each video shot and dialog utterance, taking into account interactions between local story groups. Unlike traditional summarization tasks, our method extracts multiple plot points from long-form videos. We conducted a comprehensive evaluation of our approach, including assessing its performance in crossseries generalization. TaleSumm demonstrates promising results, not only on the video summarization benchmarks but also in effectively summarizing the intricate storylines of the TV shows in the PlotSnap dataset. Our project implementation as well as dataset features and demo can be found at https: //github.com/katha-ai/RecapStorySumm-CVPR2024.

Year of completion:	April 2024
Advisor :	C V Jawahar