CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us
  • Login

Lip-to-Speech Synthesis


Rudrabha Mukhopadhyay

Abstract

This thesis explores the development and advancement of lip-to-speech synthesis techniques, addressing the challenge of generating speech directly from visual lip movements. Unlike text-to-speech systems that rely on explicit linguistic information in the form of text tokens, lip-to-speech synthesis, aims to interpret ambiguous visual cues, presenting unique challenges in mapping similar lip shapes that can produce different sounds. Inspired by the chronological advancements in text-to-speech synthesis the research goals are broken into single-speaker lip-to-speech where a specific model is trained for each speaker with a large amount of speaker-specific data followed by multi-speaker approaches which aims to train a single model which can work for any speaker in-the-wild.

The first work presented in this thesis deals with lip-to-speech generation problem in large vocabulary in unconstrained settings albeit with a model trained for a particular speaker. In this work, a novel sequence-to-sequence model was introduced that leveraged spatio-temporal convolutional architectures to effectively capture the fine-grained temporal dynamics of lip movements and implemented a monotonic attention mechanism that more accurately aligned the visual features with corresponding speech parameters. Testing on the LRS2 dataset showed a 24% improvement in intelligibility metrics over baseline methods. In this work, a new dataset was released providing sufficient speaker-specific data with a diverse vocabulary of around 5, 000 words to support the development of accurate, speaker-specific models. While this approach showed promise, it was obviously limited to single-speaker scenarios and failed to scale effectively to sentence-level multi-speaker tasks, necessitating further research.

 

Year of completion:  June 2025
 Advisor :

Dr. C.V. Jawahar


Related Publications


    Downloads

    thesis

     

    Efficient Physically Based Rendering with Analytic and Neural Approximations


    Aakash KT

    Abstract

    Path tracing is ubiquitous for photorealistic rendering of various real-world appearances. It follows the principles of light transport adapted from physics, which describe light propagation as a set of integral equations. These equations are stochastically evaluated by tracing light rays in virtual scenes. Such stochastic evaluations with ray-tracing form the bulk of the path tracing algorithm, which is widely used in the industry.

    Stochastic evaluations in path tracing converge to the correct answer in time that is inversely proportional to the square root of the number of iterations. This coupled with the fact that the underlying integrals are often complex and high dimensional results in large compute complexity. Research efforts have thus largely focused on accelerating path tracing by improving the stochastic sampling processes. However, it is interesting to look at efficient analytic approximations by making reasonable assumptions on the nature of these light transport integrals. Such analytic methods have the potential to achieve zero variance at the outset. Practically, they are often used in conjunction with stochastic methods thereby achieving lower variance than the fully stochastic counterparts.

    The primary focus of this thesis is to develop new (semi-)analytic methods and improve existing ones to accelerate direct lighting computations in path tracing. We base our research on the theory of Linearly Transformed Cosines (LTC) applied for direct lighting from area lights. The LTC method produces plausible renderings by building on the principles of light transport from the ground up and has proved useful for tasks other than real-time rendering We make the following three contributions that either build on LTCs or improve it.

    We first explore fully-analytic direct lighting for arbitrarily shaped area lights, built on LTCs at the core. Due to assumptions of the LTC method, it can only handle polygonal area lights. Furthermore, rendering shadows with LTCs require stochastic evaluations - our contribution here relaxes these assumptions, enabling fully-analytic direct lighting with shadows from an arbitrary shaped area light. We show that our method achieves plausible and noise-free renderings compared to semi-analytic LTCs and ground truth ray-tracing, given equal compute budget.

     

    Year of completion:  June 2025
     Advisor : Dr. P. J. Narayanan

    Related Publications

    Downloads

    thesis

    3D Shape Analysis: Reconstruction and Classification


    Jinka Sai Sagar

    Abstract

    The reconstruction and analysis of 3D objects by computational systems has been an intensive and long-lasting research problem in the graphics and computer vision scientific communities. Traditional acquisition systems are largely restricted to studio environment setup which requires multiple synchronized and calibrated cameras. With the advent of active depth sensors like time-of-flight sensors, structured lighting sensors made 3D acquisition feasible. This advancement of technology has paved way to many research problems like 3D object localization, recognition, classification, reconstruction which demand innovating sophisticated/elegant solutions to match their ever growing applications. 3D human body reconstruction, in particular, has wider applications like virtual mirror, gait analysis, etc. Lately, with the advent of deep learning, 3D reconstruction from monocular images garnered significant interest among the research community as it can be applied to in-the-wild settings.

    The reconstruction and analysis of 3D objects by computational systems has been an intensive and long-lasting research problem in the graphics and computer vision scientific communities. Traditional acquisition systems are largely restricted to studio environment setup which requires multiple synchronized and calibrated cameras. With the advent of active depth sensors like time-of-flight sensors, structured lighting sensors made 3D acquisition feasible. This advancement of technology has paved way to many research problems like 3D object localization, recognition, classification, reconstruction which demand innovating sophisticated/elegant solutions to match their ever growing applications. 3D human body reconstruction, in particular, has wider applications like virtual mirror, gait analysis, etc. Lately, with the advent of deep learning, 3D reconstruction from monocular images garnered significant interest among the research community as it can be applied to in-the-wild settings.

    Secondly, we propose PeeledHuman - a novel shape representation of the human body that is robust to self-occlusions. PeeledHuman encodes the human body as a set of Peeled Depth and RGB maps in 2D, obtained by performing ray-tracing on the 3D body model and extending each ray beyond its first intersection. We learn these Peeled maps in an end-to-end generative adversarial fashion using our novel framework - PeelGAN. The PeelGAN enables us to predict shape and color of the 3D human in an end-to-end fashion at significantly low inference rates.

     

    Year of completion:  May 2023
     Advisor : Dr. Avinash Sharma

    Related Publications


      Downloads

      thesis

      Surrogate Approximations for Similarity Measures


      Nagendar G

      Abstract

      This thesis targets the problem of surrogate approximations for similarity measures to improve their performance in various applications. We have presented surrogate approximations for popular dynamic time warping (DTW) distance, canonical correlation analysis (CCA), Intersection-over-Union (IoU), PCP, and PCKh measures. For DTW and CCA, our surrogate approximations are based on their corresponding definitions. We presented a surrogate approximation using neural networks for IoU, PCP, and PCKh measures.

      First, we propose a linear approximation for the naïve DTW distance. We try to speed up the DTW distance computation by learning the optimal alignment from the training data. We propose a surrogate kernel approximation over CCA in our next contribution. It enables us to use CCA in the kernel framework, further improving its performance. In our final contribution, we propose a surrogate approximation technique using neural networks to learn a surrogate loss function over IoU, PCP, and PCKh measures. For IoU loss, we validated our method over semantic segmentation models. For PCP, and PCKh loss, we validated over human pose estimation models.

       

      Year of completion:  Novenber 2023
       Advisor : Dr. Avinash Sharma

      Related Publications


        Downloads

        thesis

        Learning with Weak Supervision for Visual Scene Understanding


        Aditya Arun

        Abstract

        In recent years, computer vision has made remarkable progress in understanding visual scenes, including tasks such as object detection, human pose estimation, semantic segmentation, and instance segmentation. These advancements are largely driven by high-capacity models, such as deep neural networks, trained in fully supervised settings with large-scale labeled data sets. However, reliance on extensive annotations poses scalability challenges due to the significant human effort required to create these data sets. Fine-grained annotations, such as pixel-level segmentation masks, keypoint coordinates for pose estimation, or detailed object instance boundaries, provide the high precision needed for many tasks but are extremely time-consuming and costly to produce. Coarse annotations, on the other hand, such as image-level labels or approximate scribbles, are much easier and faster to create but lack the granularity required for detailed model supervision.

        To address these challenges, researchers have increasingly explored alternatives to traditional supervised learning, with weakly supervised learning emerging as a promising approach. This approach mitigates annotation costs by utilizing coarse annotations (cheaper and less detailed) during training rather than the fine-grained annotations required at the output stage during testing. Despite its potential, weakly supervised learning faces challenges in transferring information from coarse annotations to fine-grained predictions, often encountering ambiguity and uncertainty during this process. Existing methods rely on various priors and heuristics to refine annotations, which are then used to train models for specific tasks. This involves managing uncertainty in latent variables during training and ensuring accurate predictions for both latent and output variables at test time.

         

        Year of completion:  June 2025
         Advisor :

        Prof. C.V. Jawahar

        Prof. M. Pawan Kumar


        Related Publications


          Downloads

          thesis

           

          More Articles …

          1. Document Image Layout Segmentation and Applications
          2. Interpretation and Analysis of Deep Face Representations: Methods and Applications
          3. Image Factorization for Inverse Rendering
          4. Modelling Structural Variations in Brain Aging
          • Start
          • Prev
          • 1
          • 2
          • 3
          • 4
          • 5
          • Next
          • End
          1. You are here:  
          2. Home
          3. Research
          4. MS Thesis
          5. Doctoral Dissertations
          Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.