Text-based Video Question Answering

Soumya Shamarao Jahagirdar

Abstract

Think of a situation where you put yourself in the shoes of a visually impaired person who wants to buy an item from a store or a person who is sitting in their house and watching the news on the television and wants to know about the content of the news being broadcast. Motivated by many more such situations where creating systems capable of understanding and reasoning over textual content in the videos, in this thesis, we tackle the novel problem of text-based video question answering. Vision and Language are broadly regarded as cornerstones of intelligence. Though each of these has different aims – language has the purpose of communication, and transmission of information, and vision has the purpose of constructing mental representations of the scene around us to navigate and interact with objects. When we study both of these fields jointly, it can result in applications, tasks, and methods that, when combined go beyond the scope compared to when they are used individually. This inter-dependency is being studied as a newly emerging area of a study named “multi-modal understanding”. Many tasks such as image captioning, visual question answering, video question answering, text-video retrieval, and more fall under the category of multi-modal understanding and reasoning tasks. To have a system that can reason over both text-based information and temporal-based information, we propose a new task. The first portion of this thesis focuses on the formulation of the text-based VideoQA task, by first analyzing the current datasets and works and thereby arriving at the need for text-based VideoQA. To this end, we propose the NewsVideoQA dataset where the question-answer pairs are framed on the text present in the news videos. As this is a new task proposed, we experiment with existing methods such as text-only models, single-image scene text-based models, and video question-answering models. As these baseline methods were not originally designed for the task of video question-answering using text in the videos, the need for a video question-answering model that can take the text in the videos into account to obtain answers became the need. To this end, we repurpose the existing VideoQA model to incorporate OCR tokens namely – OCR-aware SINGULARITY, a video question-answering framework that learns joint representations of videos and OCR tokens at the pretraining stage and also uses the OCR tokens at the finetuning stage. In this second portion of the thesis, we look into the M4-ViteVQA dataset which aims to solve the same task of text-based video question-answering but the videos belong to multiple categories such as shopping, traveling, vlogging, gaming, and so on. We perform a data exploratory analysis where we analyze both NewsVideoQA and M4-ViteVQA on several aspects that look for limitations in these datasets. Through the data exploratory experiment, we show that most of the questions in both datasets have questions that can be answered just by reading the text present in the videos. We also observe that most of the questions can be answered using a single to few frames in the videos. We perform an exhaustive analysis on a text-only model: BERT-QA which obtains comparable results to the multimodal methods. We also perform cross-domain experiments to check if training followed by finetuning on two different categories of videos helps the target dataset. In the end, we also provide some insights into creating a dataset and how certain types of annotations can help the community come up with better datasets in the future. We hope this work motivates future research on text-based video question-answering in multiple video categories. Furthermore, the pretraining strategies and combined representation learning from these videos and the multiple modalities that videos provide us will help create scalable systems and drive future research towards better datasets and creative solutions.

Year of completion:	March 2024
Advisor :	C V Jawahar