CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Learnable HMD Facial De-occlusion for VR Applications in a Person-Specific Setting


Surabhi Gupta

Abstract

Immersive technologies such as Virtual Reality (VR) and Augmented Reality (AR) are among the fasted growing and fascinated technology today. As the name suggests, these technologies promise to provide users with a much better experience using immersive head-mounted displays. Medicine, culture, education, and architecture are some areas that have already taken advantage of this technology. Popular video conferencing platforms such as Microsoft Teams, Zoom, and Google Meet is working to improve user experience by allowing users to use their digital avatars. Nevertheless, they lack immersiveness and realism. What if we extend these applications to virtual reality platforms so people can feel lively talking to each other? Integrating virtual reality platforms in collaborative spaces such as virtual telepresence systems have become quite popular after globalization since it enables multiple users to share the same virtual environment, thus mimicking real-life face-to-face interactions. For a better immersive experience in virtual telepresence/communication systems, it is essential to recover the entire face, including the portion masked by the headsets (e.g., Head-Mounted Displays abbreviated as HMDs). Several methods have been proposed in the literature that deal with this problem in various forms, such as HMD removal and face inpainting. Despite some remarkable explorations, none of these methods promises to provide usable results as expected in virtual reality platforms. Addressing these challenges in the real-world deployment of AR/VR-based applications draws emerging attention. Considering the existing limitations and usability of previous solutions, we explore various research challenges and propose a practical approach to facial de-occlusion/HMD removal for virtual telepresence systems. This thesis is well-documented to motivate and introduce the audience to various research challenges in facial de-occlusion, familiarizing them with existing solutions and their inapplicability in our problem domain, followed by the idea and formulation of our proposed solution to tackle the problem. With this view, the first chapter lays the outline of this thesis. In the second chapter, we propose a method for facial de-occlusion and discuss the importance of personalized facial de-occlusion methods in enhancing the sense of realism in virtual environments. The third chapter talks about the refinement of to previously proposed network in improving the reconstruction of the eye region. Last but not least, the final chapter briefly discusses the existing face datasets in face reconstruction and inpainting, followed by an overview of the dataset we collected for this work, from acquisition to making it usable for training deep learning models. In addition, we also attempt to extend image-based facial de-occlusion to video frames using off-the-shelf approaches, briefly explained in Appendix A.

Year of completion:  November 2022
 Advisor : Avinash Sharma, Anoop M Namboodiri

Related Publications


    Downloads

    thesis

    Interactive Video Editing using Machine Learning Techniques


    Anchit Gupta

    Abstract

    There is no doubt that videos are today's most popular content consumption method. With the rise of the streaming giants such as YouTube, Netflix, etc., video content is accessible to more people. Naturally, video content creation has also increased to cater to the rising demand. In order to reach out to a wider audience, the creators dub their content. An important aspect of dubbing is not only changing the speech but also lip synchronizing the speaker in the video. Talking-face video generation works have achieved state-of-the-art results in synthesizing videos with accurate lip synchronization. However, most of the previous works deal with low-resolution talking-face videos (up to 256 × 256 pixels), thus, generating extremely high-resolution videos still remains a challenge. Also, with advancements in internet and camera tech more and more number of people are able to create video content and that too in ultra high resolution such as 4K (3840 × 2160). In this thesis, we take a giant leap and propose a novel method to synthesize talking-face videos at resolutions as high as 4K! Our task presents several key challenges: (i) Scaling the existing methods to such high resolutions is resource-constrained, both in terms of compute and the availability of very high-resolution datasets, (ii) The synthesized videos need to be spatially and temporally coherent. The sheer number of pixels that the model needs to generate while maintaining the temporal consistency at the video level makes this task non-trivial and has never been attempted in literature. We propose to train the lip-sync generator in a compact Vector Quantized (VQ) space for the first time to address these issues. Our core idea to encode the faces in a compact 16 × 16 representation allows us to model high-resolution videos. In our framework, we learn the lip movements in the quantized space on the newly collected 4K Talking Faces (4KTF) dataset. Our approach is speaker agnostic and can handle various languages and voices. We benchmark our technique against several competitive works and show that we can achieve a remarkable 64-times more pixels than the current state-of-the-art! Now, how to edit videos using the above algorithm or any other deep learning algorithm? To do so, the person has to download the source code of the required method and run the code manually. How amazing would it be if people could use the deep learning techniques in video editors with a click of a single button? In this thesis, we also propose a video editor based on OpenShot with several state-of-theart facial video editing algorithms as added functionalities. Our editor provides an easy-to-use interface to apply modern lip-syncing algorithms interactively. Apart from lip-syncing, the editor also uses audio and facial re-enactment to generate expressive talking faces. The manual control improves the overall experience of video editing without missing out on the benefits of modern synthetic video generation algorithms. This control enables us to lip-sync complex dubbed movie scenes, interviews, television shows, and other visual content. Furthermore, our editor provides features that automatically translate lectures from spoken content, lip-sync of the professor, and background content like slides. While doing so, we also tackle the critical aspect of synchronizing background content with the translated speech. We qualitatively evaluate the usefulness of the proposed editor by conducting human evaluations. Our evaluations show a clear improvement in the efficiency of using human editors and an improved video generation quality

    Year of completion:  November 2022
     Advisor : C V Jawahar,Vinay P Namboodiri

    Related Publications


      Downloads

      thesis

      Refining 3D Human Digitization Using Learnable Priors


      Routhu Snehith Goud

      Abstract

      The 3D reconstruction of a human from a monocular RGB image is an interesting yet very challenging research problem in the field of computer vision. It has various applications in the movie industry, the gaming industry, and AR/VR applications. Hence it is important to recover detailed geometric 3D reconstruction of humans in order to enhance the realism. The problem is ill-posed in nature owing to high freedom of human pose, self-occlusions, loose clothing, camera viewpoint, and illumination. Though it is possible to generate accurate geometric reconstructions using structured light scanners or multi-view cameras these are cost expensive and require a specific setup e.g., numerous cameras, controlled illumination, etc. With the current advancement of deep learning techniques, the focus of the community shifted to the 3D reconstruction of people from monocular RGB images. The goal of the thesis is toward 3D Human Digitization of people in loose clothing with accurate person-specific details. The problem of reconstruction of people in loose clothing is difficult as the topology of clothing is different from the human body. A multi-layered shape representation called PeeledHuman was able to deal with loose clothing and occlusions but it suffered from discontinuity or distorted body parts in the occluded regions. To overcome this we propose peeled semantic segmentation maps which provide the semantic information of the body parts across multiple peeled layers. These Peeled semantic maps help the network in predicting consistent depth for the pixels belonging to the same body part across different peeled maps. Our proposed Peeled Segmentation maps help in improving reconstruction both quantitatively and qualitatively. Additionally, the 3D semantic segmentation labels have various applications, for example, can be used to extract the clothing from the reconstructed output. The face plays an important role in 3D Human digitization, it gives a person their identity and the realism enhances with high-frequency facial details. However, the face appears in a smaller region of the image which captures the complete body and makes the task of high-fidelity face reconstruction along with the body even more challenging. We reconstruct person-specific facial geometry along with the complete body by incorporating a facial prior and refining it further using our proposed framework. Another common challenge faced by the existing methods is the surface noise i.e, false geometric edges generated because of the textural edges present in the image space. We address this problem by incorporating a wrinkle map prior that distinguishes the geometrical from the textural edges coming from image space. In summary, in this thesis, we address the problem of 3D human digitization from monocular images. We evaluate our proposed solution on various existing 3D human datasets and demonstrate that our proposed solutions outperform the existing state-of-the-art methods. The limitations present in the proposed methods were mentioned and potential solutions on how to address them have been briefly discussed. In the end, some future directions that can be potentially explored based on the proposed solutions have been discussed. Finally, we proposed efficient and robust methods to recover accurate and personalized 3D humans from images.

      Year of completion:  December 2022
       Advisor : Avinash Sharma

      Related Publications


        Downloads

        thesis

        Deep Neural Models for Generalized Synthesis of Multi-Person Actions


        Debtanu Gupta

        Abstract

        The ability to synthesize novel and diverse human motion at scale is indispensable not only to the umbrella field of computer vision but in multitudes of allied fields such as animation, human computer interaction, robotics and human robot interaction. Over the years, various approaches have been proposed including physics-based simulation, key-framing, database methods, etc. But ever since the renaissance of deep learning and the rapid development of computing, the generation of synthetic human motion using deep learning based methods have received significant attention. Apart from pixel-based video data, the availability of reliable motion capture systems has enabled pose-based human action synthesis. Much of it is owed to the development of frugal motion capture systems, which enabled the curation of large scale skeleton action datasets. In this thesis, we focus on skeleton-based human action generation. To begin with, we study an approach for large-scale skeleton-based action generation. In doing so, we introduce MUGL, a novel deep neural model for large-scale, diverse generation of single and multiperson pose-based action sequences with locomotion. Our controllable approach enables variable-length generations customizable by action category, across more than 100 categories. To enable intra/intercategory diversity, we model the latent generative space using a Conditional Gaussian Mixture Variational Autoencoder. To enable realistic generation of actions involving locomotion, we decouple local pose and global trajectory components of the action sequence. We incorporate duration-aware feature representations to enable variable-length sequence generation. We use a hybrid pose sequence representation with 3D pose sequences sourced from videos and 3D Kinect-based sequences of NTU-RGBD120. To enable principled comparison of generation quality, we employ suitably modified strong baselines during evaluation. Although smaller and simpler compared to baselines, MUGL provides better quality generations, paving the way for practical and controllable large-scale human action generation. Further, we study the approaches for methods that are generalizable across datasets with varying properties and we also study methods for dense skeleton action generation. In this backdrop, we introduce DSAG, a controllable deep neural framework for action-conditioned generation of full body multi-actor variable duration actions. To compensate for incompletely detailed finger joints in existing large-scale datasets, we introduce full body dataset variants with detailed finger joints. To overcome shortcomings in existing generative approaches, we introduce dedicated representations for encoding finger joints. We also introduce novel spatiotemporal transformation blocks with multi-head self attention and specialized temporal processing. The design choices enable generations for a large range in body joint counts (24 - 52), frame rates (13 - 50), global body movement (in-place, locomotion) and action categories (12 - 120), across multiple datasets (NTU-120, HumanAct12, UESTC, Human3.6M). Our experimental results demonstrate DSAG’s significant improvements over state-of-the-art, its suitability for action-conditioned generation at scale and also for the challenging task of long-term motion prediction.

        Year of completion:  December 2022
         Advisor : Ravi Kiran Sarvadevabhatla

        Related Publications


          Visual Grounding for Multi-modal Applications


          Kanishk Jain

          Abstract

          The task of Visual Grounding is at the intersection of computer vision and natural language processing tasks. The Visual Grounding (VG) task requires spatially localizing an entity in a visual scene based on its linguistic description. The capability to ground language in the visual domain is of significant importance for many real-world applications, especially for human-machine interaction. One such application is language-guided navigation, where the navigation of autonomous vehicles is modulated using a linguistic command. The VG task is intimately linked with the task of vision-language navigation (VLN), as both the tasks require reasoning about the linguistic command and the visual scene simultaneously. Existing approaches to VG can be divided into two categories based on the type of localization performed: (1) bounding-box/proposal-based localization and (2) pixel-level localization. This work focuses on pixel-level localization, where the segmentation mask corresponding to the entity/region referred to by the linguistic expression is predicted. The research in this thesis focuses on a novel modeling strategy for visual and linguistic modalities for the VG task, followed by the first-ever visual grounding based approach to the VLN task. We first present a novel architecture for the task of pixel-level localization, also known as Referring Image Segmentation (RIS). The architecture is based on the hypothesis that both intra-modal (wordword and pixel-pixel) and inter-modal (word-pixel) interactions are required to identify the referred entity successfully. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intra-modal interactions. We address this limitation by performing all three interactions synchronously in a single step. We validate our hypothesis empirically against existing methods and achieve State-Of-the-Art results on RIS benchmarks. Finally, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from RIS, which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. We additionally introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset with segmentation masks for the regions described by the linguistic commands. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.

          Year of completion:  December 2022
           Advisor : Vineet Gandhi

          Related Publications


            Downloads

            thesis

            More Articles …

            1. I-Do, You-Learn: Techniques for Unsupervised Procedure Learning using Egocentric Videos
            2. Continual and Incremental Learning in Computer-aided Diagnosis Systems
            3. 3D Interactive Solution for Neuroanatomy Education
            4. Towards Generalization in Multi-View Pedestrian Detection
            • Start
            • Prev
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            • 13
            • 14
            • 15
            • 16
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.