Towards Data-Driven Cinematography and Video Retargeting using Gaze

Kranthi Kumar Rachavarapu

Abstract

In recent years, with the proliferation of devices capable of capturing and consuming multimedia content, there is a phenomenal increase in multimedia consumption. And most of this is dominated by video content. This creates a need for efficient tools and techniques to create videos and better ways to render the content. Addressing these problems, in this thesis we focus on (a) Algorithms for efficient video content adaptation (b) Automating the process of video content creation. To address the problem of efficient video content adaptation, we present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. Our approach is (a) content agnostic as the same methodology is employed to re-edit a wide-angle video recording or a close-up movie sequence captured with a static or moving camera, and (b) independent of video length and can in principle re-edit an entire movie in one shot. The proposed retargeting algorithm consists of two steps. The first step employs gaze transition cues to detect time stamps where new cuts are to be introduced in the original video via dynamic programming. A subsequent step optimizes the cropping window path (to create pan and zoom effects), while accounting for the original and new cuts. The cropping window path is designed to include maximum gaze information and is composed of piecewise constant, linear and parabolic segments. It is obtained via L(1) regularized convex optimization which ensures a smooth viewing experience. We test our approach on a wide variety of videos and demonstrate significant improvement over the state-of-the-art, both in terms of computational complexity and qualitative aspects. A study performed with 16 users confirms that our approach results in a superior viewing experience as compared to the state of the art and letterboxing methods, especially for wide-angle static camera recordings. As the retargeting algorithm takes a video and adapts it to a new aspect ratio, we can only use the existing information in the video which limits the applicability. In the second part of the thesis, we address the problem of automatic video content creation by looking into the possibility of using deep learning techniques for automating cinematography. This type of formulation gives more freedom to the users to create content according to some their preferences. Specifically, we investigate the problem of predicting shot specification from the script by learning this association from real movies. The problem is posed as a sequence classification task using Long Short Term Memory (LSTM) network, which takes as input the sentence embedding and a few other high level structural features (such as sentiment, dialogue acts, genre etc.) corresponding a line of dialogue and predicts the shot specification for the corresponding line of dialogue in terms of Shot-Size, Act-React and Shot-Type categories. We have conducted a systematic study to find out effect of the combination of features and the effect of input sequence length on the classification accuracy. We propose two different formulations of the same problem using LSTM architecture and extensively studied the suitability of each of them to the current task. We also created a new dataset for this task which consists of 16000 shots and 10000 dialogue lines. The experimental results are promising in terms of quantitative measures (such as classification accuracy and F1-score).

Year of completion:	April 2019
Advisor :	Vineet Gandhi