CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition

Pose-based action recognition is predominantly tackled by approaches which treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. ‘Thumbs up’) or legs (e.g. ‘Kicking’). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters

 PSUMNet teaser image

The plot on left shows accuracy against # parameters for our proposed architecture PSUMNet (⋆) and existing approaches for the large-scale NTURGB+D 120 human actions dataset (cross subject).

 PSUMNet pipeline diagram2

Comparison between conventional training procedure used in most of the previous approaches (left) and our approach (right).

 

 

    To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet’s scalability, performance and efficiency makes it an attractive choice for action recognition and for deployment on computerestricted embedded and edge devices.

Supplementary          PDF     GitHub

 PSUMNet architecture diagram 2

  1. Overall Architecture of one stream of the proposed architecture. The input skeleton is passed through Multi modality data generator (MMDG), which generates joint, bone, joint velocity and bone velocity data from input and concatenates each modality data into channel dimension as shown in (b).
  2. This multi-modal data is processed via Spatio Temporal Relational Module (STRM) followed by global average pooling and FC.
  3. Spatio Temporal Relational Block (STRB), where input data is passed through Spatial Attention Map Generator (SAMG) for spatial relation modeling, followed by Temporal Relational Module. As shown in (a) multiple STRB stacked together make the STRM.
  4. Spatial Attention Map Generator (SAMG), dynamically models adjacency matrix (Ahyb)to model spatial relations between joints. Predefined adjacency matrix (A) is used for regularization.
  5. Temporal Relational Module (TRM) consists of multiple temporal convolution blocks in parallel. Output of each temporal convolution block is concatenated to generate final features.

 

DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games

Nikhil Bansal        Kartik Gupta        Kiruthika Kannan        Sivani Pentapati        Ravi Kiran Sarvadevabhatla       

ACMMM 202

PAPER        DATASET     CODE


What is atypical sketch content and why do we need to detect them?

Pictionary, the popular sketch-based game forbids drawer from writing text(atypical content) on canvas. Intervention of such rule violations is impractical and not scalable in web-based online setting of this game involving large number of multiple concurrent sessions. Apart from malicious game play, atypical sketch content can also exist in non-malicious, benign scenarios. For instance, the Drawer may choose to draw arrows and other such icons to attract the Guesser’s attention and provide indirect hints regarding the target word. Accurately localizing such activities can aid statistical learning approaches which associate sketch-based representations with corresponding target words.
 

AtyPict- the first ever dataset of atypical whiteboard content

The categories of atypical content usually encountered in Pictionary sessions are:

  • Text: Drawer directly writes the target word or hints related to the target word on the canvas.
  • Numerical: Drawer writes numbers on canvas.
  • Circles: Drawers often circle a portion of the canvas to emphasize relevant or important content.
  • Iconic: Other items used for emphasizing content and abstract compositional structures include drawing a question mark, arrow and other miscellaneous structures (e.g. double-headed arrow, tick marks, addition symbol, cross) and striking out the sketch (which usually implies negation of thesketched item).

Multiclass Samples

Examples of atypical content detection. False negatives are shown as dashed rectangles and false positives as dotted rectangles. Color codes are: text, numbers, question marks, arrows, circles and other icons (e.g. tick marks, addition symbol).

pictdraw

 

Screenshots of our data collection tool showing Drawer (left) and Guesser (right) activity during a Pictionary game. In this case, the Drawer has violated the game rules by writing text (`Spiderm') on the canvas. An automatic alert notifying the player (see top left of screenshot) and identifying the text location (red box on canvas) is generated by our system DrawMon.

 

 

 

CanvasDash: an intuitive dashboard UI for annotation and visualization

labelling tool

An illustration of annotation using our Canvas-Dash interface.

atypict stats2

 

The distribution of atypical content categories show significant imbalance with category 'Individual letters' occurring more often than others.

DrawMon: a distributed system for sketchcontent-based alert generation

DrawMon - a distributed alert generation system (see figure below). Each game session is managed by a central Session Manager which assigns a unique session id.

 

  • For a given session, whenever a sketch stroke is drawn, the accumulated canvas content (i.e. strokes rendered so far) is tagged with session id and relayed to a shared Session Canvas Queue.
  • For efficiency, the canvas content is represented as a lightweight Scalable Vector Graphic (SVG) object. The contents of the Session Canvas Queue are dequeued and rendered into corresponding 512×512 binary images by Distributed Rendering Module in a distributed and parallel fashion.
  • The rendered binary images tagged with session id are placed in the Rendered Image Queue. The contents of Rendered Image Queue are dequeued and processed by Distributed Detection Module. Each Detection module consists of our custom-designed deep neural network CanvasNet.

PictGuess architecture

CanvasNet: a model for detecting atypical sketch instances

CanvasNet processes the rendered image as input and outputs a list of atypical activities (if any) along with associated meta-information (atypical content category, 2-D spatial location).

CanvasNet

DrawMon in Action

 

 

Paper

  • PDF: Paper
  • arXiv: Coming soon!
  • ACMMM-2022: Coming soon!

Code

The code for this work is available on GitHub!
Link: pictionary-cvit/drawmon

Acknowledgements

We wish to acknowledge grant from KCIS - TCS foundation.

 

Bibtex

Please consider citing the following works if you make use of our work:

@InProceedings{DrawMonACMMM2022,
author="Bansal, Nikhil
and Gupta, Kartik
and Kannan, Kiruthika
and Pentapati, Sivani
and Sarvadevabhatla, Ravi Kiran",
title="DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games",
booktitle = "ACM conference on Multimedia (ACMMM)",
year="2022"
}

Compressing Video Calls using Synthetic Talking Heads


Madhav Agarwal, Anchit Gupta , Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

BMVC, 2022

[ Interactive Demo ] | [ Demo Video ]

architecture final

We depict the entire pipeline used for compressing talking head videos. In our pipeline, we detect and send key points of alternate frames over the network and regenerate the talking heads at the receiver’s end. We then use frame interpolation to generate the rest of the frames and use super-resolution to generate high-resolution outputs

Abstract

We leverage the modern advancements in talking head generation to propose an end-to-end system for talking head video compression. Our algorithm transmits pivot frames intermittently while the rest of the talking head video is generated by animating them. We use a state-of-the-art face reenactment network to detect key points in the non-pivot frames and transmit them to the receiver. A dense flow is then calculated to warp a pivot frame to reconstruct the non-pivot ones. Transmitting key points instead of full frames leads to significant compression. We propose a novel algorithm to adaptively select the best-suited pivot frames at regular intervals to provide a smooth experience. We also propose a frame-interpolater at the receiver’s end to improve the compression levels further. Finally, a face enhancement network improves reconstruction quality, significantly improving several aspects like the sharpness of the generations. We evaluate our method both qualitatively and quantitatively on benchmark datasets and compare it with multiple compression techniques. A demo video is attached to the supplementary, providing qualitative results

Paper

  • Paper
    Compressing Video Calls using Synthetic Talking Heads

    Madhav Agarwal, Anchit Gupta, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar
    The 33rd British Machine Vision Conference, BMVC, 2022.
    [PDF ] | [BibTeX]

    @inproceedings{compressing2022bmvc,
    title={Compressing Video Calls using Synthetic Talking Heads},
    author={Agarwal, Madhav and Gupta, Anchit and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P and Jawahar, CV},
    booktitle={British Machine Vision Conference (BMVC)},
    year={2022} }

Demo

Your browser does not support the video tag.


Contact

  • Madhav Agarwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
  • Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

Audio-Visual Face Reenactment


Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

WACV, 2023

[ Code ]   | [ Interactive Demo ] | [ Demo Video ]

architecture final

The overall pipeline of our proposed Audio Visual Face Reenactment network (AVFR-GAN) is given in this Figure. We take source and driving images along with their face mesh and segmentation mask to extract keypoints. An audio encoder extracts features from driving audio and use them provide attention on lip region. The audio and visual feature maps are warped together and passed to the carefully designed Identity-Aware Generator along with extracted features of source image to generate the final output.

Abstract

This work proposes a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints. We improve the quality of lip sync using audio as an additional input, helping the network to attend to the mouth region. We use additional priors using face segmentation and face mesh to improve the structure of the reconstructed faces. Finally, we improve the visual quality of the generations by incorporating a carefully designed identity-aware generator module. The identity-aware generator takes the source image and the warped motion features as input to generate a high-quality output with fine-grained details. Our method produces state-of-the-art results and generalizes well to unseen faces, languages, and voices. We comprehensively evaluate our approach using multiple metrics and outperforming the current techniques both qualitative and quantitatively. Our work opens up several applications, including enabling low bandwidth video calls.

Paper

  • Paper
    Audio-Visual Face Reenactment

    Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar
    IEEE/CVF Winter Conference on Applications of Computer Vision,WACV, 2023.
    [PDF ] | [BibTeX]

    @InProceedings{Agarwal_2023_WACV,
    author = {Agarwal, Madhav and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C. V.},
    title = {Audio-Visual Face Reenactment},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month = {January},
    year = {2023},
    pages = {5178-5187}
    }

Demo

Your browser does not support the video tag.


Contact

  1. Madhav Agarwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

My View is the Best View: Procedure Learning from Egocentric Videos


Siddhant Bansal  Chetan Arora and C.V. Jawahar

ECCV 2022

PAPER       DATASET      CODE

What is Procedure Learning?

Given multiple videos of a task, the goal is to identify the key-steps and their order to perform the task.

 procedure learning

Provided multiple videos of making a pizza, the goal is to identify the steps required to prepare the pizza and their order.

 

EgoProceL Dataset

Your browser does not support the video tag.

EgoProceL is a large-scale dataset for procedure learning. It consists of 62 hours of egocentric videos recorded by 130 subjects performing 16 tasks for procedure learning. EgoProceL contains videos and key-step annotations for multiple tasks from CMU-MMAC, EGTEA Gaze+, and individual tasks like toy-bike assembly, tent assembly, PC assembly, and PC disassembly.

Why an egocentric dataset for Procedure Learning?

Using third-person videos for procedure learning makes the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action.
ECCV diagrams first person vs third person v1
Existing datasets majorly consist of third-person videos for procedure learning. Third-person videos contain issues like occlusion and atypical camera locations that makes them ill-suited for procedure learning. Additionally, the datasets rely on videos from YouTube that are noisy. In contrast, we propose to use egocentric videos that overcome the issues posed by third-person videos. Third-person frames in the figure are from ProceL and CrossTask and the first-person frames are from EgoProceL.

Overview of EgoProceL

EgoProceL consists of

  • 62 hours of videos captured by
  • 130 subjects
  • performing 16 tasks
  • maximum of 17 key-steps
  • average 0.38 foreground ratio
  • average 0.12 missing steps ratio
  • average 0.49 repeated steps ratio

Downloads

We recommend referring to the README before downloading the videos. Mirror link.

Videos

Link: pc-assembly

Link: pc-disassembly

Annotations

Link: Google Drive

CnC framework for Procedure Learning

We present a novel self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. Our experiments show that CnC outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets by 5.2% and 6.3%, respectively.

ECCV diagrams Methodology v0 5

CnC takes in multiple videos from the same task and passes them through the embedder network trained using the proposed TC3I loss. The goal of the embedder network is to learn similar embeddings for corresponding key-steps from multiple videos and for temporally close frames. The ProCut Module (PCM) localizes the key-steps required for performing the task. PCM converts the clustering problem to a multi-label graph cut problem. The output provides the assignment of frames to the respective key-steps and their ordering.

Paper

  • PDF: Paper; Supplementary
  • arXiv: Paper; Abstract
  • ECCV: Coming soon!

Code

The code for this work is available on GitHub!

Link: Sid2697/EgoProceL-egocentric-procedure-learning

Acknowledgements

This work was supported in part by the Department of Science and Technology, Government of India, under DST/ICPS/Data-Science project ID T-138. A portion of the data used in this paper was obtained from kitchen.cs.cmu.edu and the data collection was funded in part by the National Science Foundation under Grant No. EEEC-0540865. We acknowledge Pravin Nagar and Sagar Verma for recording and sharing the PC Assembly and Disassembly videos at IIIT Delhi. We also acknowledge Jehlum Vitasta Pandit and Astha Bansal for their help with annotating a portion of EgoProceL.

 

Please consider citing if you make use of the EgoProceL dataset and/or the corresponding code:

 
@InProceedings{EgoProceLECCV2022,
author="Bansal, Siddhant
and Arora, Chetan
and Jawahar, C.V.",
title="My View is the Best View: Procedure Learning from Egocentric Videos",
booktitle = "European Conference on Computer Vision (ECCV)", 
year="2022"
}

@InProceedings{CMU_Kitchens,
author = "De La Torre, F. and Hodgins, J. and Bargteil, A. and Martin, X. and Macey, J. and Collado, A. and Beltran, P.",
title = "Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) database.",
booktitle = "Robotics Institute",
year = "2008"
}

@InProceedings{egtea_gaze_p,
author = "Li, Yin and Liu, Miao and Rehg, James M.",
title =  "In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video",
booktitle = "European Conference on Computer Vision (ECCV)",
year = "2018"
}

@InProceedings{meccano,
    author    = "Ragusa, Francesco and Furnari, Antonino and Livatino, Salvatore and Farinella, Giovanni Maria",
    title     = "The MECCANO Dataset: Understanding Human-Object Interactions From Egocentric Videos in an Industrial-Like Domain",
    booktitle = "Winter Conference on Applications of Computer Vision (WACV)",
    year      = "2021"
}

@InProceedings{tent,
author = "Jang, Youngkyoon and Sullivan, Brian and Ludwig, Casimir and Gilchrist, Iain and Damen, Dima and Mayol-Cuevas, Walterio",
title = "EPIC-Tent: An Egocentric Video Dataset for Camping Tent Assembly",
booktitle = "International Conference on Computer Vision (ICCV) Workshops",
year = "2019"
}

More Articles …

  1. Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors
  2. Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
  3. ETL: Efficient Transfer Learning for Face Tasks
  4. Canonical Saliency Maps: Decoding Deep Face Models
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.