CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us
  • Login

Unsupervised Audio-Visual Lecture Segmentation

Darshan Singh S*, Anchit Gupta*, C.V. Jawahar and Makarand Tapaswi

 

CVIT,   IIIT Hyderabad

WACV, 2023

[ Code ]   | [Dataset ] | [ arXiv ] | [ Demo Video ]

 

architecture final

 We address the task of lecture segmentation in an unsupervised manner. We show an example of a lecture segmented using our method. Our method predicts segments close to the ground-truth. Note that our method does not predict the segment labels, they are only shown so that the reader can appreciate the different topics.

Abstract

This  Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introduc- ing video lecture segmentation that splits lectures into bite- sized topics. Lecture clip representations leverage visual, textual, and OCR cues and are trained on a pretext self- supervised task of matching the narration with the temporally aligned visual content. We formulate lecture segmentation as an unsupervised task and use these representations to generate segments using a temporally consistent 1- nearest neighbor algorithm, TW-FINCH. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach.

Paper

  • Paper
    Unsupervised Audio-Visual Lecture Segmentation

    Darshan Singh S, Anchit Gupta, C.V. Jawahar and Makarand Tapaswi
    Unsupervised Audio-Visual Lecture Segmentation, WACV, 2023.
    [PDF ] | [BibTeX]

    Updated Soon

Demo

--- COMING SOON ---


Contact

  1. Darshan Singh S - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Anchit Gupta - This email address is being protected from spambots. You need JavaScript enabled to view it.

PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition

Pose-based action recognition is predominantly tackled by approaches which treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. ‘Thumbs up’) or legs (e.g. ‘Kicking’). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters

 PSUMNet teaser image

The plot on left shows accuracy against # parameters for our proposed architecture PSUMNet (⋆) and existing approaches for the large-scale NTURGB+D 120 human actions dataset (cross subject).

 PSUMNet pipeline diagram2

Comparison between conventional training procedure used in most of the previous approaches (left) and our approach (right).

 

 

    To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet’s scalability, performance and efficiency makes it an attractive choice for action recognition and for deployment on computerestricted embedded and edge devices.

Supplementary          PDF     GitHub

 PSUMNet architecture diagram 2

  1. Overall Architecture of one stream of the proposed architecture. The input skeleton is passed through Multi modality data generator (MMDG), which generates joint, bone, joint velocity and bone velocity data from input and concatenates each modality data into channel dimension as shown in (b).
  2. This multi-modal data is processed via Spatio Temporal Relational Module (STRM) followed by global average pooling and FC.
  3. Spatio Temporal Relational Block (STRB), where input data is passed through Spatial Attention Map Generator (SAMG) for spatial relation modeling, followed by Temporal Relational Module. As shown in (a) multiple STRB stacked together make the STRM.
  4. Spatial Attention Map Generator (SAMG), dynamically models adjacency matrix (Ahyb)to model spatial relations between joints. Predefined adjacency matrix (A) is used for regularization.
  5. Temporal Relational Module (TRM) consists of multiple temporal convolution blocks in parallel. Output of each temporal convolution block is concatenated to generate final features.

 

DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games

Nikhil Bansal        Kartik Gupta        Kiruthika Kannan        Sivani Pentapati        Ravi Kiran Sarvadevabhatla       

ACMMM 202

PAPER        DATASET     CODE


What is atypical sketch content and why do we need to detect them?

Pictionary, the popular sketch-based game forbids drawer from writing text(atypical content) on canvas. Intervention of such rule violations is impractical and not scalable in web-based online setting of this game involving large number of multiple concurrent sessions. Apart from malicious game play, atypical sketch content can also exist in non-malicious, benign scenarios. For instance, the Drawer may choose to draw arrows and other such icons to attract the Guesser’s attention and provide indirect hints regarding the target word. Accurately localizing such activities can aid statistical learning approaches which associate sketch-based representations with corresponding target words.
 

AtyPict- the first ever dataset of atypical whiteboard content

The categories of atypical content usually encountered in Pictionary sessions are:

  • Text: Drawer directly writes the target word or hints related to the target word on the canvas.
  • Numerical: Drawer writes numbers on canvas.
  • Circles: Drawers often circle a portion of the canvas to emphasize relevant or important content.
  • Iconic: Other items used for emphasizing content and abstract compositional structures include drawing a question mark, arrow and other miscellaneous structures (e.g. double-headed arrow, tick marks, addition symbol, cross) and striking out the sketch (which usually implies negation of thesketched item).

Multiclass Samples

Examples of atypical content detection. False negatives are shown as dashed rectangles and false positives as dotted rectangles. Color codes are: text, numbers, question marks, arrows, circles and other icons (e.g. tick marks, addition symbol).

pictdraw

 

Screenshots of our data collection tool showing Drawer (left) and Guesser (right) activity during a Pictionary game. In this case, the Drawer has violated the game rules by writing text (`Spiderm') on the canvas. An automatic alert notifying the player (see top left of screenshot) and identifying the text location (red box on canvas) is generated by our system DrawMon.

 

 

 

CanvasDash: an intuitive dashboard UI for annotation and visualization

labelling tool

An illustration of annotation using our Canvas-Dash interface.

atypict stats2

 

The distribution of atypical content categories show significant imbalance with category 'Individual letters' occurring more often than others.

DrawMon: a distributed system for sketchcontent-based alert generation

DrawMon - a distributed alert generation system (see figure below). Each game session is managed by a central Session Manager which assigns a unique session id.

 

  • For a given session, whenever a sketch stroke is drawn, the accumulated canvas content (i.e. strokes rendered so far) is tagged with session id and relayed to a shared Session Canvas Queue.
  • For efficiency, the canvas content is represented as a lightweight Scalable Vector Graphic (SVG) object. The contents of the Session Canvas Queue are dequeued and rendered into corresponding 512×512 binary images by Distributed Rendering Module in a distributed and parallel fashion.
  • The rendered binary images tagged with session id are placed in the Rendered Image Queue. The contents of Rendered Image Queue are dequeued and processed by Distributed Detection Module. Each Detection module consists of our custom-designed deep neural network CanvasNet.

PictGuess architecture

CanvasNet: a model for detecting atypical sketch instances

CanvasNet processes the rendered image as input and outputs a list of atypical activities (if any) along with associated meta-information (atypical content category, 2-D spatial location).

CanvasNet

DrawMon in Action

 

 

Paper

  • PDF: Paper
  • arXiv: Coming soon!
  • ACMMM-2022: Coming soon!

Code

The code for this work is available on GitHub!
Link: pictionary-cvit/drawmon

Acknowledgements

We wish to acknowledge grant from KCIS - TCS foundation.

 

Bibtex

Please consider citing the following works if you make use of our work:

@InProceedings{DrawMonACMMM2022,
author="Bansal, Nikhil
and Gupta, Kartik
and Kannan, Kiruthika
and Pentapati, Sivani
and Sarvadevabhatla, Ravi Kiran",
title="DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games",
booktitle = "ACM conference on Multimedia (ACMMM)",
year="2022"
}

Compressing Video Calls using Synthetic Talking Heads


Madhav Agarwal, Anchit Gupta , Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

BMVC, 2022

[ Interactive Demo ] | [ Demo Video ]

architecture final

We depict the entire pipeline used for compressing talking head videos. In our pipeline, we detect and send key points of alternate frames over the network and regenerate the talking heads at the receiver’s end. We then use frame interpolation to generate the rest of the frames and use super-resolution to generate high-resolution outputs

Abstract

We leverage the modern advancements in talking head generation to propose an end-to-end system for talking head video compression. Our algorithm transmits pivot frames intermittently while the rest of the talking head video is generated by animating them. We use a state-of-the-art face reenactment network to detect key points in the non-pivot frames and transmit them to the receiver. A dense flow is then calculated to warp a pivot frame to reconstruct the non-pivot ones. Transmitting key points instead of full frames leads to significant compression. We propose a novel algorithm to adaptively select the best-suited pivot frames at regular intervals to provide a smooth experience. We also propose a frame-interpolater at the receiver’s end to improve the compression levels further. Finally, a face enhancement network improves reconstruction quality, significantly improving several aspects like the sharpness of the generations. We evaluate our method both qualitatively and quantitatively on benchmark datasets and compare it with multiple compression techniques. A demo video is attached to the supplementary, providing qualitative results

Paper

  • Paper
    Compressing Video Calls using Synthetic Talking Heads

    Madhav Agarwal, Anchit Gupta, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar
    The 33rd British Machine Vision Conference, BMVC, 2022.
    [PDF ] | [BibTeX]

    @inproceedings{compressing2022bmvc,
    title={Compressing Video Calls using Synthetic Talking Heads},
    author={Agarwal, Madhav and Gupta, Anchit and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P and Jawahar, CV},
    booktitle={British Machine Vision Conference (BMVC)},
    year={2022} }

Demo

Your browser does not support the video tag.


Contact

  • Madhav Agarwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
  • Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

Audio-Visual Face Reenactment


Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

WACV, 2023

[ Code ]   | [ Interactive Demo ] | [ Demo Video ]

architecture final

The overall pipeline of our proposed Audio Visual Face Reenactment network (AVFR-GAN) is given in this Figure. We take source and driving images along with their face mesh and segmentation mask to extract keypoints. An audio encoder extracts features from driving audio and use them provide attention on lip region. The audio and visual feature maps are warped together and passed to the carefully designed Identity-Aware Generator along with extracted features of source image to generate the final output.

Abstract

This work proposes a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints. We improve the quality of lip sync using audio as an additional input, helping the network to attend to the mouth region. We use additional priors using face segmentation and face mesh to improve the structure of the reconstructed faces. Finally, we improve the visual quality of the generations by incorporating a carefully designed identity-aware generator module. The identity-aware generator takes the source image and the warped motion features as input to generate a high-quality output with fine-grained details. Our method produces state-of-the-art results and generalizes well to unseen faces, languages, and voices. We comprehensively evaluate our approach using multiple metrics and outperforming the current techniques both qualitative and quantitatively. Our work opens up several applications, including enabling low bandwidth video calls.

Paper

  • Paper
    Audio-Visual Face Reenactment

    Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar
    IEEE/CVF Winter Conference on Applications of Computer Vision,WACV, 2023.
    [PDF ] | [BibTeX]

    @InProceedings{Agarwal_2023_WACV,
    author = {Agarwal, Madhav and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C. V.},
    title = {Audio-Visual Face Reenactment},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month = {January},
    year = {2023},
    pages = {5178-5187}
    }

Demo

Your browser does not support the video tag.


Contact

  1. Madhav Agarwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

More Articles …

  1. My View is the Best View: Procedure Learning from Egocentric Videos
  2. Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors
  3. Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
  4. ETL: Efficient Transfer Learning for Face Tasks
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.