CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Audio-Visual Speech Recognition and Synthesis

 


Abhishek Jha

Abstract

Understanding speech in the absence of audio, from the visual perception of lip-motion can aid a variety of computer vision applications. System comprehending ‘silent speech presents a promising potential for low bandwidth video-calling, speech transmission in auditory noisy environment to aid for hearing impaired. While presenting numerous opportunities, it is highly difficult to model lips in silent speech video by observing lip-motion of speaker. Albeit developments in automatic-speech recognition (ASR) has yielded better audio-speech recognition systems in last two decades, in the presence of noise their performance drastically deteriorates. This calls for a computer vision solution to the speech under- standing problem. In this thesis, we present two solutions for modelling lips in silent speech videos. In the first part of the thesis, we propose a word-spotting solution for searching spoken keywords in silent lip-videos. In this work on visual speech recognition our contributions are twofold: we develop a pipeline for recognition-free retrieval, and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. 2) We introduce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatio-temporal landmarks of the query and the top retrieval candidates. The proposed pipeline improves baseline performance by over 35% for word-spotting task on one of the largest lipreading corpus. We demonstrate the robustness of our method through a series of experiments, by investigating domain invariance, out-of-vocabulary prediction and careful analysis of results on dataset. We also present the qualitative results showing success and failure cases. We finally show the application of our method by spotting words in an archaic speech video. In the second part of our work, we propose a lip-synchronization solution for ‘visually redubbing speech videos in a target language. Current methods of adapting a native speech video in foreign lan- guage either through placement of subtitle in the video, which distracts the viewer or through audio redubbing the video in the target language. This causes unsynchronized lip-motion of the speaker with respect to the redubbed audio, resulting in video appearing unnatural. In this work, we propose two lip synchronization methods: 1) cross-accent lip-synchronization for change in accent of the same language audio dubbing, and 2) cross-language lip-synchronization for speech videos dubbed in a differentlanguage. Since viseme remains the same in cross-accent dubbing, we propose a dynamic programing algorithm to align the visual speech from the original video with the accented speech in the target audio. In cross-language dubbing overall linguistics changes, hence we propose a lip-synthesis model conditioned upon on the redubbed audio. Finally, a user-based study is conducted, which validates our claim of better viewing experience in comparison to baseline methods. We present the application of both these methods by ‘visually redubbing Andrew Ngs machine learning tutorial video clips in Indian accented English and Hindi language respectively. In the final part of this thesis, we propose an improved method of 2D lip-landmark localization method. We investigated the current landmark localization techniques in facial domain and human- pose estimation to discover the shortcoming in adapting these methods for the task of lip-landmark localization. Present state-of-the-art methods in the domain considers lip-landmarks as a subset of facial landmarks and hence doesn’t explicitly optimizes for it. In this work we propose a new lip-centric loss formulation on the existing stacked-hourglass architecture which improves the baseline performance. Finally we use 300W and 300VW faces dataset to show the performance of our methods and compare them with the baselines. Overall, in this thesis we examined the current methods of lip modelling, investigated them for their shortcomings and proposed solutions to overcome those challenges. We perform detailed studies, and ablation studies to study our proposed methods and reported both success and failure cases for the same. We compare our solutions with the current baseline on challenging datasets, reporting quantitative results and demonstrating qualitative performances. Our proposed solutions improves the baseline performances in their individual domains.

 

Year of completion:  April 2019
 Advisor : C V Jawahar, Vinay P. Namboodiri

Related Publications

  • Abhishek Jha, Vinay P. Namboodiri and C.V. Jawahar - Spotting words in silent speech videos: a retrieval-based approach,  Machine Vision and Applications 2019 [PDF]

  • Abhishek Jha, Vinay Namboodiri and C.V. Jawahar -  Word Spotting in Silent Lip Videos, IEEE Winter Conference on Applications of Computer Vision (WACV 2018), Lake Tahoe, CA, USA, 2018 [PDF]


Downloads

thesis

On Compact Deep Neural Networks for Visual Place Recognition, Object Recognition and Visual Localization

 


Soham Saha

Abstract

There has been an immense increase in the use of Deep Neural Networks in recent times due to the availability of more data and greater computing power. With their recent success, it has been a trend to use them extensively in real-time applications. However, the size of deep models can render them incapable to be used in devices with memory-constraints. In this thesis, we explore the several neural network compression techniques for three separate tasks namely i) visual place recognition, ii) object recognition and iii) visual localization. We explore explicit compression methods for the visual place recognition task and the object recognition task, achieved by making modifications to the learned weight matrices. Furthermore, we look at compression attained through architectural modifications in the network itself, proposing novel training procedures and new loss functions for object recognition and visual locali zation. The task of visual place recognition requires us to correctly identify a place given its image, by finding out images of the same place in the world(dataset). Performing this on low memory devices such as mobile phones and robotics systems, is a challenging problem. The state of the art models for this task uses deep learning architectures having close to 100 million parameters which take over 400MB of memory. This makes these models infeasible to be deployed in low memory devices and gives rise to the need of compressing them. Hence, we study the effectiveness of explicit model compression techniques like trained quantization and pruning, on one of the most effective visual place recognition models. We show that a compressed network can be created by starting with a pre-trained model and then fine-tuning it via trained pruning and quantization. Through this training method, the compressed model is able to produce the same mAP as the original uncompressed network. We achieve almost 50% parameter reduction through pruning with no loss in mAP and 70% reduction with close to 2% mAP reduction, while also performing trained 8-bit quantization. Furthermore, together with 5-bit quantization, we perform about 50% parameter reduction by pruning and get only about 3% reduction in mAP. The resulting compressed networks have sizes of around 30 MB and 65 MB which makes them easily usable in memory constrained devices. We next move on to compression through low rank approximation for the task of image classification. Traditional compression algorithms in deep networks involves performing low-rank approximations on the learned weight matrices after the training procedure has been completed. We propose to perform low rank approximation during training itself and make the parameters of the approximated matrix learnable too by using a suitable loss function. We show that by using our method, we are able to compress a base-model providing 89% accuracy, by 10x, with some loss in performance. Using our compression based training procedure, our compressed model is able to achieve an accuracy of about 84%. Next, we focus on developing compressed models for the object recognition task and propose a novel architecture for the same. Deep neural networks for image classification typically consists of a convolutional feature extractor followed by a fully connected classifier network. The predicted and the ground truth labels are represented as one hot vectors. Such a representation assumes that all classes are equally dissimilar. However, classes have visual similarities and often form a hierarchy. We propose an alternate architecture for the classifier network called the Latent Hierarchy Classifier which can discover a latent hierarchy of the classes while simultaneously reducing the number of parameters used in the original classifier. We show that, for some of the best performing architectures on CIFAR and Imagenet datasets, our proposed alternate classifier and training procedure, recovers the accuracy. Also, our proposed method significantly reduces the parameter complexity of the classifier. We achieve a reduction in the number of parameters of the classification layer by 98% for CIFAR 100 and 41% for the Imagenet 1K dataset. We also verify that many visually similar classes are grouped together, under the learnt hierarchy. Finally, we address the problem of Visual Localization where the task is to predict the camera orientation and pose of the given input scene. We propose an anchor point classification based solution for this task by using single camera images only. Our proposed three-way branching of the feature extractor into an Anchor Point Classifier, a Relative Offset Regressor and an Absolute Regressor, is able to achieve <2m translation localization and <5 ◦ pose localization on the Cambridge Landmarks dataset, while also obtaining state-of-the-art in median distance localization for orientation for all the 6 scenes. Our method not only uses fewer parameters than previous deep learning based methods but also improves on memory footprint as well as test-time over nearest neighbour based approaches.

 

Year of completion:  April 2019
 Advisor : C V Jawahar, Girish Varma

Related Publications


    Downloads

    thesis

    Road Topology Extraction from Satellite images by Knowledge Sharing

     


    Anil Kumar

    Abstract

    Motivated by human visual system machine or computer vision technology has been revolutionized in last few decades and make a strong impact in wide range of applications such as object recognition, face recognition and identification etc. However, despite much encouraging advancement, there are still many fields which lack to utilize the full potential of computer vision techniques. One such field is to analyze the satellite images for geo-spatial applications. In past, building and launching the satellites in to space was expensive, and was big hurdle in acquir- ing low cost images from satellites. However, with technological innovations, the inexpensive satellites are capable of sending terabytes of images of our planet on the daily basis that can provide insights on global-scale economic, social and industrial processes. The significant applications of satellite imagery are urban planning, crop yield forecasting, mapping and change detection. The most obvious application of satellite imagery is to extract topological road network from the satellite images, as it plays an important role in planning the mobility between multiple geographical locations of interest. The extraction of road topology from the satellite images is formulated as binary segmentation problem in vision community. Despite of huge satellite imagery, the fundamental hurdle in applying computer vision algorithms based on deep learning architectures is unavailability of labeled data and causes the poor results. Another challenge in extraction of the roads from satellite imagery is visual ambiguity in identifying the roads and occlusion by various objects. This challenge causes many standard algorithms in computer vision research to perform poorly and is the major concern. In this thesis we develop deep learning based models and techniques that allows us to address the above challenges. In the first part of our work, we make an attempt to perform road segmentation with the less labeled data and existing unsupervised feature learning techniques. In particular, we use self-supervised technique to learn visual representations with an artificial supervision, followed by fine tuning of model with labeled dataset. We use semantic image in-painting as an artificial/auxiliary task for supervision. The enhancement of road segmentation is in direct relation with the features captured by model through inpainting of the erased regions in the image. To further enhance the feature learning, we propose to inpaint the difficult regions of the image and develop a novel adversarial training scheme to learn mask used for erasing the image. The proposed scheme gradually learns to erase regions, which are difficult to inpaint. Thus, this increase in difficulty level of image in-painting leads to better road segmentation. Additionally, we study the proposed approach on scene parsing and land classification in satellite images.

     

    Year of completion:  June 2019
     Advisor : C V Jawahar

    Related Publications


      Downloads

      thesis

      Towards Scalable Applications for Handwritten Documents

       


      rowtula Vijay

      Abstract

      Even in today’s world, a large number of documents are generated as handwritten documents. This is specially true when the knowledge/expertise is captured conveniently with availability of electronic gad- gets. Information extraction from handwritten document images has numerous applications, especially in digitization of archived handwritten documents, assessing patient medical records and automated evaluation of student handwritten assessments, to mention a few. Document categorization and tar- geted information extraction from various such sources can help in designing better search and retrieval systems for handwritten document images. Information extraction from handwritten medical records written in ambulance for doctor’s interpretation in hospital, reading postal address to automate the let-ter sorting are examples where document image work flow helped in scaling the system with minimal human intervention. In such work flow systems, images flow across subjects who can be in different locations. Our work is motivated with the success of these document image work-flow systems that were put into practice when the handwriting recognition accuracy was unacceptably low. Our goal is to bring scalability in handwritten document processing which can enhance the throughput of the analysis by employing multitude of developments in document image space. In this thesis, we initially focus on presenting a document image workflow system that helps in scal-ing the handwritten student assessments in a typical university setting. We observed that this improves the efficiency since the book keeping time as well as physical paper movement is minimized. An electronic workflow can make the anonymization easy, alleviating the fear of biases in many cases. Also, parallel and distributed assessment by multiple instructors is straightforward in an electronic workflow system. At the heart of our solution, we have (i) a distributed image capture module with a mobile phone (ii) image processing algorithms that improve the quality and readability (iii) image annotation module that process the evaluations/feedbacks as a separate layer. Further, we extend our work by proposing an approach to detect POS and Named Entity tags directly from offline handwritten document images without explicit character/word recognition. We observed that POS tagging on handwritten text sequences increases the predictability of named entities and also brings a linguistic aspect to handwritten document analysis. As a pre-processing step, the document image is binarized and segmented into word images. The proposed approach comprising of a CNN - LSTM model, trained on word image sequences produces encouraging results on challenging IAM dataset. Finally, we describe an effective method for automatically evaluating the short descriptive hand-written answers from the digitized images. Automated evaluation of handwritten answers has been a challenging problem for scaling education system for many years. Speeding up the evaluation still re- mains as the major bottleneck for enhancing the throughput. Our goal is to assign an evaluation score that is comparable to the human assigned scores. Our solution is based on the observation that a human evaluator judges the relevance of the answer using a set of keywords and their semantics. Since reliable handwriting recognizer are not yet available, we attempt this problem in the image space. We model this problem as a self supervised, feature based classification problem, which can fine tune itself for each question without any explicit supervision. We conduct experiments on three different datasets obtained from students. Experiments show that our method performs comparable to that of human evaluators. With these works, we attempted to bring state-of-the-art enhancements in handwritten document analysis and deep learning into scalable applications which can be helpful in the field of education.

       

      Year of completion:  June 2019
       Advisor : C V Jawahar

      Related Publications

      • Vijay Rowtula, Subba Reddy Oota and C. V. Jawahar -  Towards Automated Evaluation of Handwritten Assessments, The 15th International Conference on Document Analysis and Recognition (ICDAR), 2019, 20 - 25 September 2019, Australia.[PDF]

      • Vijay Rowtula, Praveen Krishnan, C.V. Jawahar - POS Tagging and Named Entity Recognition on Handwritten Documents, ICON, 2018[PDF]


      Downloads

      thesis

      Analyzing Racket Sports From Broadcast Videos

       


      Anurag Ghosh

      Abstract

      Sports video data is recorded for nearly every major tournament but remains archived and inaccessi-ble to large scale data mining and analytics. However, Sports videos have a inherent temporal structure,due to the nature of sports themselves. For instance, tennis, comprises of points, games and sets played between the two players/teams. Recent attempts in sports analytics are not fully automatic for finer details or have a human in the loop for high level understanding of the game and, therefore, have limited practical applications to large scale data analysis. Many of these applications depend on specialized camera setups and wearable devices which are costly and unwieldy for players and coaches, specially in a resource constrained environments like India. Utilizing very simple and non-intrusive sensor(s) (like a single camera) along with computer vision models is necessary to build indexing and analytics systems. Such systems can be used to sort through huge swathes of data, help coaches look at interesting video segments quickly, mine player data and even generate automatic reports and insights for a coach to monitor. Firstly, we demonstrate a score based indexing approach for broadcast video data. Given a broadcast sport video, we index all the video segments with their scores to create a navigable and searchable match. Even though our method is extensible to any sport with scores, we evaluate our approach on broadcast tennis videos. Our approach temporally segments the rallies in the video and then recognizes the scores from each of the segments, before refining the scores using the knowledge of the tennis scoring system. We finally build an interface to effortlessly retrieve and view the relevant video segments by also automatically tagging the segmented rallies with human accessible tags such as ‘fault’ and ‘deuce’. The efficiency of our approach is demonstrated on broadcast tennis videos from two major tennis tournaments. Secondly, we propose an end-to-end framework for automatic attributes tagging and analysis of broadcast sport videos. We use commonly available broadcast videos of badminton matches and, un- like previous approaches, we do not rely on special camera setups or additional sensors. We propose a method to analyze a large corpus of broadcast videos by segmenting the points played, tracking and recognizing the players in each point and annotating their respective strokes. We evaluate the performance on 10 Olympic badminton matches with 20 players and achieved 95.44% point segmentation accuracy, 97.38% player detection score (mAP@0.5), 97.98% player identification accuracy, and stroke segmentation edit scores of 80.48%. We further show that the automatically annotated videos alone could enable the gameplay analysis and inference by computing understandable metrics such as player’s reaction time, speed, and footwork around the court, etc. Lastly, we adapt our proposed framework for tennis games to mine spatiotemporal and event data from large set of broadcast videos. Our broadcast videos include all Grand Slam matches played between Roger Federer, Rafael Nadal and Novac Djokovic. Using this data, we demonstrate that we can infer the playing styles and strategies of tennis players. Specifically, we study the evolution of famous rivalries of Federer, Nadal, and Djokovic across time. We compare and validate our inferences with expert opinions of their playing styles.

       

      Year of completion:  June 2019
       Advisor : C V Jawahar

      Related Publications

      • Anurag Ghosh, Suriya Singh and C.V. Jawahar -  Towards Structured Analysis of Broadcast Badminton Videos IEEE Winter Conference on Applications of Computer Vision (WACV 2018), Lake Tahoe, CA, USA, 2018 [PDF]

      • Anurag Ghosh and C. V. Jawahar -  SmartTennisTV: Automatic indexing of tennis videos National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2017 [PDF]

      • Anurag Ghosh, Yash Patel, Mohak Sukhwani and C.V. Jawahar - Dynamic Narratives for Heritage Tour 3rd Workshop on Computer Vision for Art Analysis (VisART), European Conference on Computer Vision (ECCV), 2016 [PDF]


      Downloads

      thesis

      More Articles …

      1. Efficient Annotation and Knowledge Distillation for Semantic Segmentation
      2. Handwritten Word Recognition for Indic & Latin scripts using Deep CNN-RNN Hybrid Networks
      3. Graph-Spectral Techniques For Analyzing Resting State Functional Neuroimaging Data
      4. Exploration of multiple imaging modalities for Glaucoma detection
      • Start
      • Prev
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • Next
      • End
      1. You are here:  
      2. Home
      3. Research
      4. Thesis
      5. Thesis Students
      Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.