Improved Representation Spaces for Videos

Bipasha Sen

Abstract

Videos form an integral part of human lives and act as one of the most natural forms of perception spanning both the spatial and the temporal dimensions: the spatial dimension emphasizes the content, whereas the temporal dimension emphasizes the change. Naturally, studying this modality is an important area of computer vision. Notably, one must efficiently capture this high-dimensional modality to perform different downstream tasks robustly. In this thesis, we study representation learning for videos to perform two key aspects of video-based tasks: classification and generation. In a classification task, a video is compressed to a latent space that captures the key discriminative properties of a video relevant to the task. On the other hand, generation involves starting with a latent space (often a known space, such as standard normal) and learning a valid mapping between the latent and the video manifold. This thesis explores complementary representation techniques to develop robust representation spaces useful for diverse downstream tasks. In this vein, this thesis starts by tackling video classification, where we concentrate on a specific task of “lipreading” (transliterating videos to text) or in technical terms - classifying videos of mouth movements. Through this work, we propose a compressed generative space that self-augments the dataset improving the discriminative capabilities of the classifier. Motivated by the findings of this work, we move on to finding an improved generative space in which we touch upon several key elements of video generation, including unconditional video generation, video inversion, and video superresolution. In the classification task, we aim to study lipreading (or visually recognizing speech from the mouth movements of a speaker), a challenging and mentally taxing task for humans to perform. Unfortunately, multiple medical conditions force people to depend on this skill in their day-to-day lives for essential communication. Patients suffering from ‘Amyotrophic Lateral Sclerosis’ (ALS) often lose muscle control, consequently, their ability to generate speech and communicate via lip movements. Existing large datasets do not focus on medical patients or curate personalized vocabulary relevant to an individual. Collecting large-scale datasets of a patient needed to train modern data-hungry deep learning models is, however, extremely challenging. We propose a personalized network designed to lipread for an ALS patient using only one-shot examples. We depend on synthetically generated lip movements to augment the one-shot scenario. A Variational Encoder-based domain adaptation technique is used to bridge the real-synthetic domain gap. Our approach significantly improves and achieves high top-5 accuracy with 83.2% accuracy compared to 62.6% achieved by comparable methods for the patient. Apart from evaluating our approach on the ALS patient, we also extend it to people with hearing impairment, relying extensively on lip movements to communicate. In the next part of the thesis, we focus on representation spaces for video-based generative tasks. Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This approach confines the expressivity of videos to image-based operations on individual frames, necessitating network designs that can achieve temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. We evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showcasing the potential of the proposed representation space. In summary, this thesis makes a significant contribution to the field of computer vision by exploring representation learning for videos. The proposed methods are thoroughly evaluated through extensive experimentation and analysis, which clearly demonstrate their advantages over existing works. These findings have the potential to advance a range of video-based applications, including personalized healthcare, entertainment, and communication. By developing robust representation spaces that improve video classification and generation, this work opens up new possibilities for more natural and effective ways of perceiving, understanding, and interacting with videos

Year of completion:	August 2023
Advisor :	C V Jawahar, Vinay P Namboodiri