CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Monocular 3D Human Body Reconstruction


Abbhinav Venkat

Abstract

Monocular 3D human reconstruction is a very relevant problem due to numerous applications to the entertainment industry, e-commerce, health care, mobile-based AR/VR platforms, etc. However, it is severely ill-posed due to self-occlusions from complex body poses and shapes, clothing obstructions, lack of surface texture, background clutter, single view, etc. Conventional approaches address these challenges by using different sensing systems - marker-based, marker-less multi-view cameras, inertial sensors, and 3D scanners. Although effective, such methods are often expensive and have limited wide-scale applicability. In an attempt to produce scalable solutions, a few have focused on fitting statistical body models to monocular images, but are susceptible to the costly optimization process. Recent efforts focus on using data-driven algorithms such as deep learning to learn priors directly from data. However, they focus on template model recovery, rigid object reconstruction, or propose paradigms that don’t directly extend to recovering personalized models. To predict accurate surface geometry, our first attempt was VolumeNet, which predicted a 3D occupancy grid from a monocular image. This was the first of its kind model for non-rigid human shapes at that time. To circumvent the ill-posed nature of this problem (aggravated by an unbounded 3D representation), we follow the ideology of providing maximal training priors with our unique training paradigms, to enable testing with minimal information. As we did not impose any body-model based constraint, we were able to recover deformations induced by free-form clothing. Further, we extended VolumeNet to PoShNet by decoupling Pose and Shape, in which we learn the volumetric pose first, and use it as a prior for learning the volumetric shape, thereby recovering a more accurate surface. Although volumetric regression enables recovering a more accurate surface reconstruction, they do so without an animatable skeleton. Further, such methods yield reconstructions of low resolution at higher computational cost (regression over the cubic voxel grid) and often suffer from an inconsistent topology via broken or partial body parts. Hence, statistical body models become a natural choice to offset the ill-posed nature of this problem. Although theoretically, they are low dimensional, learning such models has been challenging due to the complex non-linear mapping from the image to the relative axis-angle representation. Hence, most solutions rely on different projections of the underlying mesh (2D/3D keypoints, silhouettes, etc.). To simplify the learning process, we propose the CR framework that uses classification as a prior for guiding the regression’s learning process. Although recovering personalized models with high-resolution meshes isn’t a possibility in this space, the framework shows that learning such template models can be difficult without additional supervision. As an alternative to directly learning parametric models, we propose HumanMeshNet to learn an “implicitly structured point cloud”, in which we make use of the mesh topology as a prior to enable better learning. We hypothesize that instead of learning the highly non-linear SMPL parameters, learning its corresponding point cloud (although high dimensional) and enforcing the same parametric template topology on it is an easier task. This proposed paradigm can theoretically learn local surface deformations that the body model based PCA space can’t capture. Further, going ahead, attempting to produce highresolution meshes (with accurate geometry details) is a natural extension that is easier in 3D space than in the parametric one. In summary, in this thesis, we attempt to address several of the aforementioned challenges and empower machines with the capability to interpret a 3D human body model (pose and shape) from a single image in a manner that is non-intrusive, inexpensive and scalable. In doing so, we explore different 3D representations that are capable of producing accurate surface geometry, with a long-term goal of recovering personalized 3D human models.

Year of completion:  November 2020
 Advisor : Avinash Sharma

Related Publications


    Downloads

    thesis

    Neural and Multilingual Approaches to Machine Translation for IndianLanguages and its Applications


    Jerin Philip

    Abstract

    Neural Machine Translation (NMT), together with multilingual formulations have arisen as the de-facto standard in translating a sentence from a source language to a target language. However, unlike many western languages, the available resources like training data of parallel sentences ortrained models which can be used to build and demonstrate applications in other domains are limited for the languagesin the Indian subcontinent. This work takes a major step towards closing this gap.In this work, we describe the development of state-of-the art translation solutions for 10 Indian lan-guages and English. We do this in four parts described below:1.Considering the Hindi-English language pair we successfully develop an NMT solution for anarrow-domain, demonstrating its application in translating cricket commentary.2.Through heavy data augmentation, we extend the above to the general domain and build a state-of-the art MT system for Hindi-English language pair. Further, We extend to five more languagesby taking advantage of multiway formulations.3.WedemonstratetheapplicationoftheNMTincontributingmoreresourcestothealreadyresource-scarce field, expanding to 11 langauges and its application in a multimodal task of translating a talking face to a target language with lip synchronization.4.Next, we improve both data-situation and performance for machine translation in 11 Indian Lan-guages iteratively to place our models in a standardized, comparable set of metrics setting up forfuture advances in the space to comprehensively evaluate and compare against.

    Year of completion:  August 2020
     Advisor : C V Jawahar, Vinay P. Namboodiri

    Related Publications


      Downloads

      thesis

      Lip-syncing Videos In The Wild


      Prajwal K R

      Abstract

      The widespread access to the Internet has led to a meteoric rise in audio-visual content consumption. Our content consumption habits have changed from listening to podcasts and radio broadcasts to watching videos on YouTube. We are now increasingly preferring the highly engaging nature of video calls over plain voice calls. Given this considerable shift in desire for audio-visual content, there has also been a surge in video content creation to cater to these consumption needs. In this fabric of video content creation, especially those containing people talking, lies the problem of making these videos accessible across language barriers. If we want to translate a deep learning lecture video in English to Hindi, it is not only that the speech should be translated but also the visual stream, specifically, the lip movements. Learning to lip-sync arbitrary videos to any desired target speech is a problem with several applications ranging from video translation, to readily creating new content that would otherwise require humongous efforts. However, speaker-independent lip synthesis for any voice, and language is a very challenging task. In this thesis, we tackle the problem of lip-syncing videos in the wild to any given target speech. We propose two new models in this space: one that significantly improves the generation quality and the other significantly improving on lip-sync accuracy. In the first model, LipGAN, we identify key issues that plague the current approaches for speakerindependent lip synthesis that prevent them from reaching the generation quality of speaker-specific models. Specifically, ours is the first model to generate face images that can be pasted back into the video frame. This feature is crucial for all the real-world applications where the face is just a small part of the entire content being displayed. We show that our improvements in quality lead to multiple real-world applications that have not been demonstrated in any of the previous lip-sync works. In the second model, Wav2Lip, we investigate why current models are inaccurate while lip-syncing arbitrary talking face videos. We hypothesize that the reason is weak penalization. This finding allows us to create a lip-sync model that can generate lip-synced videos for any identity and voice with remarkable accuracy and quality. We re-think the current evaluation framework for this task and propose multiple new benchmarks, two new metrics, and a Real world lip Sync Evaluation Dataset (ReSyncED). Also, using our model, we show applications on lip-syncing dubbed movies and animating real CGI movie clips to new speech. We also demonstrate a futuristic video call application that is useful for poor network connections. Finally, we present two major appli cations that our model can impact the most social media content creation and personalization and video translation. We hope that our advances in lip synthesis open up new avenues for research in the space of talking face generation from speech.

      Year of completion:  August 2020
       Advisor : C V Jawahar, Vinay P. Namboodiri

      Related Publications


        Downloads

        thesis

        Multiscale Two-view Stereo using Convolutional Neural Networks for Unrectified Images


        Y N Pramod Pramod

        Abstract

        Two view stereo problem is a subset of multiview stereo problem where only two views or orientations are available for estimation of depth or disparity. Given the constrained nature of the setup, traditional algorithms assume either the intrinsic or extrinsic parameters to be available in advance in order to build the homographies between the views to rectify the images. Stereo rectification allows epipolar constraint to be enforced such that the corresponding projections of a 3D point could be searched in one dimension along the epipolar lines. When both calibration matrices are not available, the two view stereo problem reduces to estimating the fundamental matrix. A condition number which measures the instability of a function when input conditions change, is high for a fundamental matrix when estimated using an 8 point algorithm. Deep learning methods have been the sought after solutions to numerous computer vision problems as the state of the art research have exposed the power in terms of learning capability of neural networks in general. We explore stereo correspondences in an uncalibrated setting in general by estimating a depthmap given a pair of unrectified stereo images. An end-to-end solution is sought after in a setting where the relative depths of pixels with respect to a single view point could be extracted with the aid of another view. Extending the capabilities of the correlation layer as devised by the flownet architecture, a modified flownet architecure is designed to regress depthmaps with an extension of multiscale correlations for handling textureless surfaces and repetitive textured surfaces. Due to unavailability of dataset for deep learning of unrectified images, a constrained setup of turn table sequences is constructed for this purpose using Google 3D warehouse models.Following the concepts of Attention modelling, we implement an architecture for combining correlations computed at multiple resolutions using a simple element-wise multiplication of the correlations to aid the architecture to resolve correspondences for textureless and repeated textured surfaces. Our experiments show both qualitative and quantitative improvements of depth maps over the original Flownet architecture and provide a solution to the unrectified stereo depth estimation which in literature, most algorithms work on stereo rectified image pairs to compute depthmaps/disparities.

        Year of completion:  May 2020
         Advisor : Anoop M Namboodiri

        Related Publications


          Downloads

          thesis

          SemanticEdge Labeling usingDepth cues


          Nishit Soni

          Abstract

          Contours are critical inhumanperception of a scene. Theyprovide information about object boundaries, surface planes and surface intersections. This information helps to isolate objects from a scene. In computer vision, contours have similar importance. It has been shown that labelled edges can contribute to segmentation, reconstruction and recognition problems. This thesis has addressed edge labeling of images in indoor and outdoor scenes using depth and RGB data. We classify the contours as occluding, planar (depth discontinuity), and convex, concave (surface normal discontinuity). This task is not straightforward and it is one of the fundamental problems in computer vision.We propose a novel algorithm using random forest for classifying edge pixels into occluding, planar, convex and concave entities.We approach the problem by first focusing on indoor images where we use depth information fromKinect. We release an indoor data set withmore than 500 RGBD images with pixel-wise ground labels. Our method produces promising results and achieves an F-score of 0.84. We also test the approach onmore complex images from from NYU kinect data set and we obtain F-Score of 0.74. While addressing this problem in outdoor images where we use depth from stereo, we realise the need for additional features. Stereo depth of outdoor scenes has artifacts and errors which cannot confidently represent an edge type locally.We show that a simple feature based on semantic classes helps improving the labeling. On Kitti outdoor driving stereo data set, we obtain occluding and planar average F-Score of 0.77 while the approach works poorly to classify curvature edges i.e convex and concave edges. We find this to be because of stereo depth errors and low resolution depth at far distance, which gives poor feature extraction. However,we acknowledge the potential of using semantic classes to improve edge labeling and with large amount of ground truth edgelabels andbetter semantic segmentation, there is ahope of improving the classification.

          Year of completion:  May 2020
           Advisor : Anoop M Namboodiri

          Related Publications


            Downloads

            thesis

            More Articles …

            1. A New Algorithm for Ray Tracing Synthetic Light Fields on the GPU
            2. Image Representations for Style Retrieval, Recognition and Background Replacement Tasks
            3. Towards developing a multiple modality fusion technique for automatic detection of Glaucoma
            4. Deep-Learning Features, Graphs and Scene Understanding
            • Start
            • Prev
            • 14
            • 15
            • 16
            • 17
            • 18
            • 19
            • 20
            • 21
            • 22
            • 23
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.