CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Improved Representation Spaces for Videos


Bipasha Sen

Abstract

Videos form an integral part of human lives and act as one of the most natural forms of perception spanning both the spatial and the temporal dimensions: the spatial dimension emphasizes the content, whereas the temporal dimension emphasizes the change. Naturally, studying this modality is an important area of computer vision. Notably, one must efficiently capture this high-dimensional modality to perform different downstream tasks robustly. In this thesis, we study representation learning for videos to perform two key aspects of video-based tasks: classification and generation. In a classification task, a video is compressed to a latent space that captures the key discriminative properties of a video relevant to the task. On the other hand, generation involves starting with a latent space (often a known space, such as standard normal) and learning a valid mapping between the latent and the video manifold. This thesis explores complementary representation techniques to develop robust representation spaces useful for diverse downstream tasks. In this vein, this thesis starts by tackling video classification, where we concentrate on a specific task of “lipreading” (transliterating videos to text) or in technical terms - classifying videos of mouth movements. Through this work, we propose a compressed generative space that self-augments the dataset improving the discriminative capabilities of the classifier. Motivated by the findings of this work, we move on to finding an improved generative space in which we touch upon several key elements of video generation, including unconditional video generation, video inversion, and video superresolution. In the classification task, we aim to study lipreading (or visually recognizing speech from the mouth movements of a speaker), a challenging and mentally taxing task for humans to perform. Unfortunately, multiple medical conditions force people to depend on this skill in their day-to-day lives for essential communication. Patients suffering from ‘Amyotrophic Lateral Sclerosis’ (ALS) often lose muscle control, consequently, their ability to generate speech and communicate via lip movements. Existing large datasets do not focus on medical patients or curate personalized vocabulary relevant to an individual. Collecting large-scale datasets of a patient needed to train modern data-hungry deep learning models is, however, extremely challenging. We propose a personalized network designed to lipread for an ALS patient using only one-shot examples. We depend on synthetically generated lip movements to augment the one-shot scenario. A Variational Encoder-based domain adaptation technique is used to bridge the real-synthetic domain gap. Our approach significantly improves and achieves high top-5 accuracy with 83.2% accuracy compared to 62.6% achieved by comparable methods for the patient. Apart from evaluating our approach on the ALS patient, we also extend it to people with hearing impairment, relying extensively on lip movements to communicate. In the next part of the thesis, we focus on representation spaces for video-based generative tasks. Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This approach confines the expressivity of videos to image-based operations on individual frames, necessitating network designs that can achieve temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. We evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showcasing the potential of the proposed representation space. In summary, this thesis makes a significant contribution to the field of computer vision by exploring representation learning for videos. The proposed methods are thoroughly evaluated through extensive experimentation and analysis, which clearly demonstrate their advantages over existing works. These findings have the potential to advance a range of video-based applications, including personalized healthcare, entertainment, and communication. By developing robust representation spaces that improve video classification and generation, this work opens up new possibilities for more natural and effective ways of perceiving, understanding, and interacting with videos

Year of completion:  August 2023
 Advisor : C V Jawahar, Vinay P Namboodiri

Related Publications


    Downloads

    thesis

    Open-Vocabulary Audio Keyword Spotting with Low Resource Language Adaptation


    Kirandevraj R

    Abstract

    Open-Vocabulary Keyword spotting solves the problem of spotting audio keywords in an utterance. The keyword set can incorporate keywords the system has seen and not seen during training, functioning as zero-shot keyword spotting. The traditional method involves using ASR to transcribe audio to text and search in the text space to spot keywords. Other methods include obtaining the posteriors from a Deep Neural Network (DNN) and using template matching algorithms to find similarities. Keyword spotting does not require transcribing the entire audio and focuses on detecting the specific words of interest. In this thesis, we aim to explore the usage of the Automatic Speech Recognition (ASR) system for Keyword Spotting. We demonstrate that the intermediate representation of ASR can be used for open vocabulary keyword spotting. With this, we show the effectiveness of using Connectionist Temporal Classification (CTC) loss for learning word embeddings for keyword spotting. We propose a novel method of using the CTC loss function with the traditional triplet loss function to learn word embeddings for keyword spotting on the TIMIT English language audio dataset. We show this method achieves an Average Precision (AP) of 0.843 over 344 words unseen by the model trained on the TIMIT dataset. In contrast, the Multi-View recurrent method that learns jointly on the text and acoustic embeddings achieves only 0.218 for out-of-vocabulary words. We propose a novel method to generalize our approach to Tamil, Vallader, and Hausa low-resource languages. Here we use transliteration to convert the Tamil language script to English such that the Tamil words sound similar written with English alphabets. The model predicts the transliterated text for input Tamil audio with CTC and triplet loss functions. We show that this method helps transfer the knowledge learned from high resource language English to low resource language Tamil. We further reduce the model size to make it work in a small footprint scenario like mobile phones. To this extent, we explore various knowledge distillation loss functions such as MSE, KL Divergence, and CosEmbedding loss functions. We observe that small-footprint ASR representation is competitive with knowledge distillation methods for small-footprint keyword spotting. This methodology makes use of existing ASR networks trained with massive datasets. It converts them into open vocabulary keyword spotting systems that can also be generalized to low-resource language.

    Year of completion:  November 2022
     Advisor : C V Jawahar,Vinay P Namboodiri,Vinod kumar Kurmi

    Related Publications


      Downloads

      thesis

      On Designing Efficient Deep Neural Networks for Semantic Segmentation


      Nikitha Vallurupalli

      Abstract

      Semantic segmentation is an essential primitive in real-time systems such as autonomous navigation, which require processing at high frames per second. Hence for models to be practically applicable, it is essential that they have to be compact, fast as well as achieve high prediction accuracies. Previous research into semantic segmentation has focused on creating high-performance deep learning architectures. Most of the time, these best-performing models are complex, deep, have large processing times, and demand a significantly higher amount of processing capacity. Another relevant area of research is model compression, by which we can obtain lightweight models. Considering that there also have been works that produced mainstream light-weight semantic segmentation models at the expense of performance, we design models that bring a desirable balance between performance and latency. Specifically, methods and architectures that give a high performance while being real-time and working on resourceconstrained settings. We identify the redundancies in the existing state-of-the-art approaches and propose compact architecture family called ESSNet with accuracy comparable to the state-of-the-art while utilizing only a fraction of the space and computational power of those networks. We propose convolutional module designs with sparse coding theory as a premise and we also present two real-time encoder backbones employing our proposed modules. We empirically evaluate the efficacy of our proposed layers and compare them with existing approaches. Secondly, we explore the need for optimization during the training phase in the proposed models and present a novel training method called Gradual Grouping that results in models with improved implementation efficiency vs accuracy trade-offs. Additionally, we conduct extensive experiments by varying architecture hyper-parameters such as network depth, kernel sizes, dilation rates, split branching and additional context extraction modules. We also present a compact architecture using multi-branch separable convolutional layers with different dilation rates.

      Year of completion:  November 2022
       Advisor : C V Jawahar,Girish Varma

      Related Publications


        Downloads

        thesis

        Towards Handwriting Recognition and Search in Indic & Latin Scripts


        Santhoshini Gongidi

        Abstract

        ML-powered document image analysis approaches can enable intelligent solutions in bringing handwritten information into the digital world. Two major components of handwriting understanding include handwritten text recognition(HTR) and handwritten search. The former task enables the conversion of handwritten text to digital format, whereas the latter task provides easy access to the handwritten information scattered across books, archives, manuscripts and so on. We primarily focus on both these problem statements in this thesis. Handwritten document image analysis for Indic scripts is still in its nascent stage compared to Latin scripts. For example, many commercial applications and open-source demonstrations are available for Latin scripts and there are hardly such known applications for Indic scripts. It is challenging to develop solutions for Indic scripts due to (i) variety of scripts within the Indic subcontinent, (ii) lack of huge annotated datasets and challenges in collecting data for multiple scripts, and (iii) inherent challenges in Indic scripts like inflections, joining multiple glyphs to form an akshara(equivalent to a character in Latin scripts). While challenging, it is also crucial to develop HTR and handwritten search approaches for Indic scripts. Therefore, this thesis is majorly focused on discussing approaches for Indic HTR and Indic handwritten search. In the last two decades, large digitization projects converted paper documents and ancient historical manuscripts into digital forms. However, they remain often inaccessible due to the unavailability of robust HTR solutions. Recognizing handwritten text is fundamental to any modern document analysis system. In recent years, efforts toward developing text recognition systems have advanced due to the success of deep neural networks and the availability of annotated datasets. This is especially true for Latin scripts. In this thesis, we discuss the standard text recognition pipeline that comprises various neural network modules. Then, we present a simple and effective way to improve the text recognition pipeline and training approach. We report the improvement from our approach on four benchmark datasets in Latin and Indic scripts. The existing state-of-the-art approaches for Latin HTR and Latin handwritten search are highly datadriven. Due to the lack of availability of large-scale data, developing Indic HTR and Indic handwritten search is challenging. Therefore, we release a collective Indic handwritten dataset with text images from majorly spoken 10 Indic scripts. We establish a high baseline for text recognition in prominent Indic scripts. Our recognition scheme follows the contemporary design principles from other recognition literature, and yields competitive results on English. We also explore the utility of pre-training for Indic HTRs. We hope our efforts will catalyze research and fuel applications related to handwritten document understanding in Indic scripts. Finally, we investigate the problem of handwritten search and retrieval for unlabeled collections. Handwritten search pipelines are needed in online platforms like E-libraries and digital archives. Such pipelines can efficiently search through handwritten collections and present relevant results, much like Google Search. With its ease of access and time-saving capability, the handwritten search application can prove to be valuable to many communities that study such historical documents. In this thesis, we present one such pipeline for handwritten search that performs retrieval on new and unseen collections. The proposed retrieval is not fine-tuned for specific writing styles or unknown vocabulary in the new collection. Therefore, it can be applied to new unlabeled collections.

        Year of completion:  November 2022
         Advisor : C V Jawahar

        Related Publications


          Downloads

          thesis

          Learnable HMD Facial De-occlusion for VR Applications in a Person-Specific Setting


          Surabhi Gupta

          Abstract

          Immersive technologies such as Virtual Reality (VR) and Augmented Reality (AR) are among the fasted growing and fascinated technology today. As the name suggests, these technologies promise to provide users with a much better experience using immersive head-mounted displays. Medicine, culture, education, and architecture are some areas that have already taken advantage of this technology. Popular video conferencing platforms such as Microsoft Teams, Zoom, and Google Meet is working to improve user experience by allowing users to use their digital avatars. Nevertheless, they lack immersiveness and realism. What if we extend these applications to virtual reality platforms so people can feel lively talking to each other? Integrating virtual reality platforms in collaborative spaces such as virtual telepresence systems have become quite popular after globalization since it enables multiple users to share the same virtual environment, thus mimicking real-life face-to-face interactions. For a better immersive experience in virtual telepresence/communication systems, it is essential to recover the entire face, including the portion masked by the headsets (e.g., Head-Mounted Displays abbreviated as HMDs). Several methods have been proposed in the literature that deal with this problem in various forms, such as HMD removal and face inpainting. Despite some remarkable explorations, none of these methods promises to provide usable results as expected in virtual reality platforms. Addressing these challenges in the real-world deployment of AR/VR-based applications draws emerging attention. Considering the existing limitations and usability of previous solutions, we explore various research challenges and propose a practical approach to facial de-occlusion/HMD removal for virtual telepresence systems. This thesis is well-documented to motivate and introduce the audience to various research challenges in facial de-occlusion, familiarizing them with existing solutions and their inapplicability in our problem domain, followed by the idea and formulation of our proposed solution to tackle the problem. With this view, the first chapter lays the outline of this thesis. In the second chapter, we propose a method for facial de-occlusion and discuss the importance of personalized facial de-occlusion methods in enhancing the sense of realism in virtual environments. The third chapter talks about the refinement of to previously proposed network in improving the reconstruction of the eye region. Last but not least, the final chapter briefly discusses the existing face datasets in face reconstruction and inpainting, followed by an overview of the dataset we collected for this work, from acquisition to making it usable for training deep learning models. In addition, we also attempt to extend image-based facial de-occlusion to video frames using off-the-shelf approaches, briefly explained in Appendix A.

          Year of completion:  November 2022
           Advisor : Avinash Sharma, Anoop M Namboodiri

          Related Publications


            Downloads

            thesis

            More Articles …

            1. Interactive Video Editing using Machine Learning Techniques
            2. Refining 3D Human Digitization Using Learnable Priors
            3. Deep Neural Models for Generalized Synthesis of Multi-Person Actions
            4. Visual Grounding for Multi-modal Applications
            • Start
            • Prev
            • 3
            • 4
            • 5
            • 6
            • 7
            • 8
            • 9
            • 10
            • 11
            • 12
            • Next
            • End
            1. You are here:  
            2. Home
            3. Research
            4. Thesis
            5. Thesis Students
            Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.