CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Computational Video Editing and Re-editing


Moneish Kumar

Abstract

The amelioration in video capture technology has made recording videos very easy. The introduction of smaller affordable cameras which not only boast high-end specifications but are also capable of capturing videos at very high resolution (4K,8K and even 16K) have made recording of high quality videos accessible to everyone. Although this makes recording of videos very straightforward and ef\fortless, a major part of the video production process - Editing is still labor intensive and requires skill and expertise. This thesis takes a step towards automating the editing process and making it less time consuming. In this thesis, (1) we explore a novel approach of automatically editing stage performances such that both the context of the scene as well as close-up details of the actors are shown. (2) We propose a new method to optimally re-target videos to any desired aspect ratio while retaining the salient regions that are derived using gaze tracking. Recordings of stage performances are easy to capture with a high-resolution camera, but are difficult to watch because the actors’ faces are too small. We present an approach to automatically create a split screen video that transforms these recordings to show both the context of the scene as well as close-up details of the actors. Given a static recording of a stage performance and the tracking information about the actors positions, our system generates videos showing a focus+context view based on computed close-up camera motions using crop-and zoom. The key to our approach is to compute these camera motions such that they are cinematically valid close-ups and to ensure that the set of views of the different actors are properly coordinated and presented. We pose the computation of camera motions as convex optimization that creates detailed views and smooth movements, subject to cinematic constraints such as not cutting faces with the edge of the frame. Additional constraints link the close up views of each actor, causing them to merge seamlessly when actors are close. Generated views are placed in a resulting layout that preserves the spatial relationships between actors. This eliminates the need for manual labour and expertise required for both capturing the performance and later editing it, instead the splitscreen of focus+context views allows the viewer to make an active decision on attending to whatever seems important. We also demonstrate our results on a variety of staged theater and dance performances. When videos are captured they are captured according to a specific aspect ratio keeping in mind the size of target screen in which they are meant to be viewed, this results in a inferior viewing experience when they are not watched on screens with their native aspect ratio. We present an approach to auto-matically retarget any given video to any desired aspect ratio while preserving its most salient regions obtained using gaze tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. The algorithm has two steps in total. The first step uses dynamic programming to find a cropping window path that maximizes gaze inclusion within the window and also tries to find the location of plausible new cuts (if required). The second step performs regularized convex optimization on the path obtained via dynamic programming to produce a smooth cropping window path comprised of piecewise linear, constant and parabolic segments. We test our re-editing algorithm on a diverse collection of movie and theater sequences. A study conducted with 16 users confirms that our retargeting algorithm results in a superior viewing experience as compared to gaze driven re-editing [30] and letterboxing methods, especially for wide-angle static camera recordings.

 

Year of completion:  November 2018
 Advisor : Vineet Gandhi

Related Publications


    Downloads

    thesis

    Geometric + Kinematic Priors and Part-based Graph Convolutional Network for Skeleton-based Human Action Recognition


    Kalpit Thakkar

    Abstract

    Videos engulf the media archive in every capacity, generating a rich and vast source of information from which machines can learn how to understand such data and predict useful attributes that can help technologies make human lives better. One of the most significant and instrumental part of a computer vision system is the comprehension of human actions from visual sequences – a paragon of machine intelligence that can be achieved through computer vision. The problem of human action recognition has high importance as it facilitates several applications built around recognition of human actions. Understanding actions from monocular sequences has been studied immensely, while comprehending human actions from a skeleton video has developed in recent times. We propose action recognition frameworks that use the skeletal data (viz. 3D locations of some joints) of human body to learn spatio-temporal representations necessary for recognizing actions. Information about human actions is composed of two integral dimensions, space and time, along which the variations occur. Spatial representation of a human action aims at encapsulating the config- uration of human body essential to the action, while the temporal representation aims at capturing the evolution of such configurations across time. To this end, we propose to use geometric relations betweenhuman skeleton joints to discern the body pose relative to the action and physically inspired kinematic quantities in order to understand the temporal evolution of body pose. Spatio-temporal understanding of human actions is thus conceived as a comprehension of geometric and kinematic information with the help of machine learning frameworks. Using a representation inculcating an amalgamation of geo- metric and kinematic features, we recognize human actions from skeleton videos (S-videos) using such frameworks. We first present a non-parametric approach for temporal sub-segmentation of trimmed action videos using angular momentum trajectory of the skeletal pose sequence. Meaningful summarization of the pose sequence is a product of the sub-segmentation achieved through systematic sampling of the seg- ments. Descriptors capturing geometric and kinematic statistics encoded as histograms and spread across a periodic range of orientations are computed to represent the summarized pose sequences, which are fed to a kernelized classifier for recognizing actions. This framework for understanding human ac- tions instils the effects of using geometric and kinematic properties of human pose sequences, important to spatio-temporal modeling of the actions. However, a downside of this framework is the inability to scale with availability of large amount of visual data.To mitigate the impending drawback, we next present geometric deep learning frameworks and specifically, graph convolutional networks, for the same task. Representation of human skeleton as a sparse spatial graph is intuitive and a structured form which lies on graph manifolds. A human action video hence results in the formation of a spatio-temporal graph and graph convolutions facilitate the learning of a spatio-temporal descriptor for the action. Inspired by the success of Deformable Part-based Models (DPMs) for the task of object understanding from images, we propose a part-based graph con- volutional network (PB-GCN) that operates on a human skeletal graph divided into parts. Incorporating the culmination of understandings from the success of geometric and kinematic features, we propose to use relative coordinates and temporal displacements of the 3D joints coordinates as features at the vertices in the skeletal graph. Owing to these signals, the prowess of graph convolutional networks is further boosted to attain state-of-the-art performance among several action recognition systems using skeleton videos. In this thesis, we meticulously examine the growth of the idea about using geometry and kinematics, transition to geometric deep learning frameworks and design a PB-GCN with geometric + kinematic signals at the vertices, for the task of human action recognition using skeleton videos.

     

    Year of completion:  March 2019
     Advisor : P J Narayanan

    Related Publications


      Downloads

      thesis

      Towards Data-Driven Cinematography and Video Retargeting using Gaze


      Kranthi Kumar Rachavarapu

      Abstract

      In recent years, with the proliferation of devices capable of capturing and consuming multimedia content, there is a phenomenal increase in multimedia consumption. And most of this is dominated by video content. This creates a need for efficient tools and techniques to create videos and better ways to render the content. Addressing these problems, in this thesis we focus on (a) Algorithms for efficient video content adaptation (b) Automating the process of video content creation. To address the problem of efficient video content adaptation, we present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a cropping window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. Our approach is (a) content agnostic as the same methodology is employed to re-edit a wide-angle video recording or a close-up movie sequence captured with a static or moving camera, and (b) independent of video length and can in principle re-edit an entire movie in one shot. The proposed retargeting algorithm consists of two steps. The first step employs gaze transition cues to detect time stamps where new cuts are to be introduced in the original video via dynamic programming. A subsequent step optimizes the cropping window path (to create pan and zoom effects), while accounting for the original and new cuts. The cropping window path is designed to include maximum gaze information and is composed of piecewise constant, linear and parabolic segments. It is obtained via L(1) regularized convex optimization which ensures a smooth viewing experience. We test our approach on a wide variety of videos and demonstrate significant improvement over the state-of-the-art, both in terms of computational complexity and qualitative aspects. A study performed with 16 users confirms that our approach results in a superior viewing experience as compared to the state of the art and letterboxing methods, especially for wide-angle static camera recordings. As the retargeting algorithm takes a video and adapts it to a new aspect ratio, we can only use the existing information in the video which limits the applicability. In the second part of the thesis, we address the problem of automatic video content creation by looking into the possibility of using deep learning techniques for automating cinematography. This type of formulation gives more freedom to the users to create content according to some their preferences. Specifically, we investigate the problem of predicting shot specification from the script by learning this association from real movies. The problem is posed as a sequence classification task using Long Short Term Memory (LSTM) network, which takes as input the sentence embedding and a few other high level structural features (such as sentiment, dialogue acts, genre etc.) corresponding a line of dialogue and predicts the shot specification for the corresponding line of dialogue in terms of Shot-Size, Act-React and Shot-Type categories. We have conducted a systematic study to find out effect of the combination of features and the effect of input sequence length on the classification accuracy. We propose two different formulations of the same problem using LSTM architecture and extensively studied the suitability of each of them to the current task. We also created a new dataset for this task which consists of 16000 shots and 10000 dialogue lines. The experimental results are promising in terms of quantitative measures (such as classification accuracy and F1-score).

       

      Year of completion:  April 2019
       Advisor : Vineet Gandhi

      Related Publications


        Downloads

        thesis

        Development and Tracking of Consensus Mesh for Monocular Depth Sequences


        Gaurav Mishra

        Abstract

        Human body tracking typically requires specialized capture set-ups. Although pose tracking is available in consumer devices like Microsoft Kinect, it is restricted to stick figures visualizing body part detection. In this thesis, we propose a method for full 3D human body shape and motion capture of arbitrary movements from the depth channel of a single Kinect, when the subject wears casual clothes. We do not use the RGB channel or an initialization procedure that requires the subject to move around in front of the camera. This makes our method applicable for arbitrary clothing textures and lighting environments, with minimal subject intervention. Our method consists of 3D surface feature detection and articulated motion tracking, which is regularized by a statistical human body model [40]. We also propose the idea of a Consensus Mesh (CMesh) which is the 3D template of a person created from a single view point. We demonstrate tracking results on challenging poses and argue that using CMesh along with statistical body models can improve tracking accuracies. Quantitative evaluation of our dense body tracking shows that our method has very little drift which is improved by the usage of CMesh We explore the possibility of improving the quality of CMesh using RGB images in a post processing step. For this we propose a pipeline involving Generative Adversarial Networks. We show that CMesh can be improved from RGB images of the original person by learning corresponding relative normal maps ( N R map ). These N R map have the potential to encode the nuances in the CMesh with respect to ground truth object. We explore such method in a synthetic setting for static human like objects. We demonstrate quantitatively that details which are learned from such a pipeline are invariant to lighting and texture changes. In future the generated N R map can be used to improve the quality of CMesh

         

        Year of completion:  June 2019
         Advisor : P J Narayanan and Kiran Varanasi

        Related Publications


          Downloads

          thesis

          Adversarial Training for Unsupervised Monocular Depth Estimation


          Ishit Mehta

          Abstract

          The problem of estimating scene-depth from a single image has seen great progress recently. It is one of the foundational problems in computer vision and hence been studied from various angles. Since the advent of deep learning, most of the approaches are data driven. These methods train high-capacity models with large amount of data in an end-to-end fashion. They rely on ground-truth depth, which is hard to capture and process. Recently self-supervised methods have been proposed which rely on view supervision as an alternative. These methods minimize photometric reconstruction error in order to learn depth. In this work,we propose a geometry-aware generative adversarial network to generate multiple novel views from a single image. Novel views are generated by learning depth as an intermediate step. The synthesized views are discerned from real images using discriminative learning. We show the gains of using the ad- versarial framework over previous methods. Furthermore, we present a structured adversarial training routine to train the network, going from easy examples to difficult ones. The combination of adversarial framework, multi-view learning, and structured training produces state-of-the-art performance on unsupervised depth estimation for monocular images. We also compare our method with human depth perception by conducting a series of experiments. We investigate the existence of monocular depth cues like relative size, occlusion and height in the visual field in artificial vision systems. With quantitative and qualitative experiments, we highlight the shortcomings of artificial depth perception and propose future avenues for research.

           

          Year of completion:  July 2019
           Advisor : P J Narayanan

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Exploring Binarization and Pruning of Convolutional Neural Networks
            2. Retinal Image Quality Improvement via Learning
            3. Extending Visual Object Tracking for Long Time Horizons
            4. Blending the Past and Present of Automatic Image Annotation
            • Start
            • Prev
            • 19
            • 20
            • 21
            • 22
            • 23
            • 24
            • 25
            • 26
            • 27
            • 28
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. MS Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.