CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

Quo Vadis, Skeleton Action Recognition ?


Abstract

n this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition.To begin with, we benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. Toexamine skeleton action recognition 'in the wild', we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videossourced from Kinetics-700, a large-scale action dataset. The results from benchmarking the top performers of NTU-120 onSkeletics-152 reveal the challenges and domain gap induced by actions 'in the wild'. We extend our study to include out-of-contextactions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. Finally, as a new frontier foraction recognition, we introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game DumbCharades and interpretative dance performances. Overall, our work characterizes the strengths and limitations of existing approachesand datasets. It also provides an assessment of top-performing approaches across a spectrum of activity settings and via theintroduced datasets, proposes new frontiers for human action recognition.
 
QuoVadis

 

To access the code and paper click here

N2NSkip: Learning Highly Sparse Networks using Neuron-to-Neuron Skip Connections


Owing to the overparametrized nature and high memory requirements of classical DNNs, there has been a renewed interest in network sparsification. Is it possible to prune a network at initialization (prior to training) while maintaining rich connectivity, and also ensure faster convergence? We attempt to answer this question by emulating the pattern of neural connections in the brain.

N2NSkip1
Figure: After a preliminary pruning step, N2NSkip connections are added to the pruned network while maintaining the overall sparsity of the network.


Abstract

The over-parametrized nature of Deep Neural Networks (DNNs) leads to considerable hindrances during deployment on low-end devices with time and space constraints. Network pruning strategies that sparsify DNNs using iterative prune-train schemes are often computationally expensive. As a result, techniques that prune at initialization, prior to training, have become increasingly popular. In this work, we propose neuron-to-neuron skip (N2NSkip) connections, which act as sparse weighted skip connections, to enhance the overall connectivity of pruned DNNs. Following a preliminary pruning step, N2NSkip connections are randomly added between individual neurons/channels of the pruned network, while maintaining the overall sparsity of the network. We demonstrate that introducing N2NSkip connections in pruned networks enables significantly superior performance, especially at high sparsity levels, as compared to pruned networks without N2NSkip connections. Additionally, we present a heat diffusion-based connectivity analysis to quantitatively determine the connectivity of the pruned network with respect to the reference network. We evaluate the efficacy of our approach on two different preliminary pruning methods which prune at initialization, and consistently obtain superior performance by exploiting the enhanced connectivity resulting from N2NSkip connections.

Methods:

In this work, inspired by the pattern of skip connections in the brain, we propose sparse, learnable neuron-to-neuron skip (N2NSkip) connections, which enable faster convergence and superior effective connectivity by improving the overall gradient flow in the pruned network. N2NSkip connections regulate overall gradient flow by learning the relative importance of each gradient signal, which is propagated across non-consecutive layers, thereby enabling efficient training of networks pruned at initialization (prior to training). This is in contrast with conventional skip connections, where gradient signals are merely propagated to previous layers. We explore the robustness and generalizability of N2NSkip connections to different preliminary pruning methods and consistently achieve superior test accuracy and higher overall connectivity. Additionally, our work also explores the concept of connectivity in deep neural networks through the lens of heat diffusion in undirected acyclic graphs. We propose to quantitatively measure and compare the relative connectivity of pruned networks with respect to the reference network by computing the Frobenius norm of their heat diffusion signatures at saturation.

N2NSkip1 N2NSkip


Figure: As opposed to conventional skip connections, N2NSkip connections introduce skip connections between non-consecutive layers of the network, and are parametrized by sparse learnable weights.

Contributions:

  • We propose N2NSkip connections which significantly improve the effective connectivity and test performance of sparse networks across different datasets and network architectures. Notably, we demonstrate the generalizability of N2NSkip connections to different preliminary pruning methods and consistently obtain superior test performance and enhanced overall connectivity.
  • We propose a heat diffusion-based connectivity measure to compare the overall connectivity of pruned networks with respect to the reference network. To the best of our knowledge, this is the first attempt at modeling connectivity in DNNs through the principle of heat diffusion.
  • We empirically demonstrate that N2NSkip connections significantly lower performance degradation as compared to conventional skip connections, resulting in consistently superior test performance at high compression ratios

Visualizing Adjacency Matrices

Considering each network as an undirected graph, we construct an n × n adjacency matrix, where n is the total number of neurons in the MLP. To verify the enhanced connectivity resulting from N2NSkip connections, we compare the heat diffusion signature of the pruned adjacency matrices with the heat diffusion signature of the reference network.

N2NSkip1
Figure: Binary adjacency matrices for (a) Reference network (MLP) (b) Pruned network at a compression of 5x (randomized pruning) (c) N2NSkip network at a compression of 5x (10% N2NSkip connections + 10% sequential connections).

Experimental Results:

N2NSkip1

Test Accuracy of pruned ResNet50 and VGG19 on CIFAR-10 and CIFAR-100 with either RP or CSP as the preliminary pruning step. The addition of N2NSkip connections leads to a significant increase in test accuracy. Additionally, there is a larger increase in accuracy at network densities of 5% and 2%, as compared to 10%. This observation is consistent for both N2NSkip-RP and N2NSkip-CSP, which indicates that N2NSkip connections can be used as a powerful tool to enhance the performance of pruned networks at high compression rates.

Related Publication:

  • Arvind Subramaniam, Avinash Sharma - N2NSkip: Learning Highly Sparse Networks using Neuron-to-Neuron Skip Connections, British Machine Vision Conference (BMVC 2020).

An OCR for Classical Indic Documents Containing Arbitrarily Long Words


Abstract

OCR for printed classical Indic documents written inSanskrit is a challenging research problem. It involves com-plexities such as image degradation, lack of datasets andlong-length words. Due to these challenges, the word ac-curacy of available OCR systems, both academic and in-dustrial, is not very high for such documents. To addressthese shortcomings, we develop a Sanskrit specific OCRsystem. We present an attention-based LSTM model forreading Sanskrit characters in line images. We introduce adataset of Sanskrit document images annotated at line level.To augment real data and enable high performance for ourOCR, we also generate synthetic data via curated font se-lection and rendering designed to incorporate crucial glyphsubstitution rules. Consequently, our OCR achieves a worderror rate of 15.97% and a character error rate of 3.71%on challenging Indic document texts and outperforms strongbaselines. Overall, our contributions set the stage for ap-plication of OCRs on large corpora of classic Sanskrit textscontaining arbitrarily long and highly conjoined words.
ocr

To access the code and paper click here

Bibtex

If you find our work useful in your research, please consider citing:

 
@InProceedings{Dwivedi_2020_CVPR_Workshops,
author = {Dwivedi, Agam and Saluja, Rohit and Kiran Sarvadevabhatla, Ravi},
title = {An OCR for Classical Indic Documents Containing Arbitrarily Long Words},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2020}
}

Topological Mapping for Manhattan-like Repetitive Environments


Abstract

We showcase a topological mapping frameworkfor a challenging indoor warehouse setting. At the most abstractlevel, the warehouse is represented as a Topological Graphwhere the nodes of the graph represent a particular warehousetopological construct (e.g. rackspace, corridor) and the edgesdenote the existence of a path between two neighbouringnodes or topologies. At the intermediate level, the map isrepresented as a Manhattan Graph where the nodes and edgesare characterized by Manhattan properties and as a Pose Graphat the lower-most level of detail. The topological constructsare learned via a Deep Convolutional Network while therelational properties between topological instances are learntvia a Siamese-style Neural Network. In the paper, we showthat maintaining abstractions such as Topological Graph andManhattan Graph help in recovering an accurate Pose Graphstarting from a highly erroneous and unoptimized Pose Graph.We show how this is achieved by embedding topological andManhattan relations as well as Manhattan Graph aided loopclosure relations as constraints in the backend Pose Graphoptimization framework. The recovery of near ground-truthPose Graph on real-world indoor warehouse scenes vindicatethe efficacy of the proposed framework.

 

Introduction

We showcase a topological mapping framework for a challenging indoor warehouse setting. At the most abstract level, the warehouse is represented as a Topological Graph where the nodes of the graph represent a particular warehouse topological construct (e.g. rackspace, corridor) and the edges denote the existence of a path between two neighbouring nodes or topologies. At the intermediate level, the map is represented as a Manhattan Graph where the nodes and edges are characterized by Manhattan properties and as a Pose Graph at the lower-most level of detail. The topological constructs are learned via a Deep Convolutional Network while the relational properties between topological instances are learnt via a Siamese-style Neural Network. In the paper, we show that maintaining abstractions such as Topological Graph and Manhattan Graph help in recovering an accurate Pose Graph starting from a highly erroneous and unoptimized Pose Graph. We show how this is achieved by embedding topological and Manhattan relations as well as Manhattan Graph aided loop closure relations as constraints in the backend Pose Graph optimization framework. The recovery of near ground-truth Pose Graph on real-world indoor warehouse scenes vindicate the efficacy of the proposed framework.
IMG

Qualitative Results:

1)RTABMAP SLAM
IMG
Fig. a shows registered map generated by RTABMAP Slam. Fig. b shows RTABMAP trajectory with topological labels. Fig. c compares the RTABMAP trajectory with groundtruth trajectory. Fig. d compares trajectory generated using our topological SLAM pipeline with groundtruth.
2)RTABMAP as Visual Odometry pipeline
IMG
Fig. a shows trajectory obtained using RTABMAP with loop closure turn off. Wheel odometry is used as odometry source. Fig. b compares RTABMAP trajectory with groundtruth. Fig. c compares trajectory obtained using Topological Slam pipeline with groundtruth.

Code:

Our pipeline consists of 3 parts - each sub-folder in this repo containts code for each:

  • Topological categorization using a convolutional neural network classifier -> Topological Classifier
  • Predicting loop closure constraints using Multi-Layer Perceptron -> Instance Comparator
  • Graph construction and pose graph optimization using obtained Manhattan and Loop Closure Constraints -> Pose Graph Optimizer

How to use each is explained in corresponding sub-folder
Please find GitHub Project Page


Bibtex

If you find our work useful in your research, please consider citing:

 @article{ puligilla2020topo, 
author = { Puligilla, Sai Shubodh and Tourani, Satyajit and Vaidya, Tushar and Singh Parihar, Udit and Sarvadevabhatla, Ravi Kiran and Krishna, Madhava }, 
title = { Topological Mapping for Manhattan-Like Repetitive Environments }, 
journal = { ICRA }, 
year = { 2020 }, 
}

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild


Prajwal Renukanand*   Rudrabha Mukhopadhyay*   Vinay Namboodiri   C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

[Code]   [Interactive Demo]   [Demo Video]   [ReSyncED]

word

We propose a novel approach that achieves significantly more accurate lip-synchronization (A) in dynamic, unconstrained talking face videos. In contrast, we can see that the corresponding lip shapes generated by the current best model (B) is out-of-sync with the spoken utterances (shown at the bottom).

Abstract

In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or on videos of specific people seen during the training phase. However, they fail to accurately morph the actual lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the newly chosen audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to specifically measure the accuracy of lip synchronization in unconstrained videos. Extensive quantitative and human evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated using our Wav2Lip model is almost as good as real synced videos. We clearly demonstrate the substantial impact of our Wav2Lip model in our publicly available demo video. We also open-source our code, models, and evaluation benchmarks to promote future research efforts in this space.


Paper

  • Paper
    A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

    Prajwal Renukanand*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
    A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild, ACM Multimedia, 2020 .
    [PDF] | [BibTeX]

    @misc{prajwal2020lip,
    title={A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild},
    author={K R Prajwal and Rudrabha Mukhopadhyay and Vinay Namboodiri and C V Jawahar},
    year={2020},
    eprint={2008.10010},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
    }

Live Demo

Please click here for the live demo : https://www.youtube.com/embed/0fXaDCZNOJc


Architecture

word

Architecture for generating speech from lip movements

Our approach generates accurate lip-sync by learning from an ``already well-trained lip-sync expert". Unlike previous works that employ only a reconstruction loss or train a discriminator in a GAN setup, we use a pre-trained discriminator that is already quite accurate at detecting lip-sync errors. We show that fine-tuning it further on the noisy generated faces hampers the discriminator's ability to measure lip-sync, thus also affecting the generated lip shapes.


Ethical Use

To ensure fair use, we strongly require that any result created using this our algorithm must unambiguously present itself as synthetic and that it is generated using the Wav2Lip model. In addition, to the strong positive applications of this work, our intention to completely open-source our work is that it can simultaneously also encourage efforts in detecting manipulated video content and their misuse. We believe that Wav2Lip can enable several positive applications and also encourage productive discussions and research efforts regarding fair use of synthetic content.


Contact

  1. Prajwal K R - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

More Articles …

  1. Fused Text Recogniser and Deep Embeddings Improve Word Recognition and Retrieval
  2. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
  3. Text-to-Speech Dataset for Indian Languages
  4. RoadText-1K: Text Detection & Recognition Dataset for Driving Videos
  • Start
  • Prev
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.