CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games

Nikhil Bansal        Kartik Gupta        Kiruthika Kannan        Sivani Pentapati        Ravi Kiran Sarvadevabhatla       

ACMMM 202

PAPER        DATASET     CODE


What is atypical sketch content and why do we need to detect them?

Pictionary, the popular sketch-based game forbids drawer from writing text(atypical content) on canvas. Intervention of such rule violations is impractical and not scalable in web-based online setting of this game involving large number of multiple concurrent sessions. Apart from malicious game play, atypical sketch content can also exist in non-malicious, benign scenarios. For instance, the Drawer may choose to draw arrows and other such icons to attract the Guesser’s attention and provide indirect hints regarding the target word. Accurately localizing such activities can aid statistical learning approaches which associate sketch-based representations with corresponding target words.
 

AtyPict- the first ever dataset of atypical whiteboard content

The categories of atypical content usually encountered in Pictionary sessions are:

  • Text: Drawer directly writes the target word or hints related to the target word on the canvas.
  • Numerical: Drawer writes numbers on canvas.
  • Circles: Drawers often circle a portion of the canvas to emphasize relevant or important content.
  • Iconic: Other items used for emphasizing content and abstract compositional structures include drawing a question mark, arrow and other miscellaneous structures (e.g. double-headed arrow, tick marks, addition symbol, cross) and striking out the sketch (which usually implies negation of thesketched item).

Multiclass Samples

Examples of atypical content detection. False negatives are shown as dashed rectangles and false positives as dotted rectangles. Color codes are: text, numbers, question marks, arrows, circles and other icons (e.g. tick marks, addition symbol).

pictdraw

 

Screenshots of our data collection tool showing Drawer (left) and Guesser (right) activity during a Pictionary game. In this case, the Drawer has violated the game rules by writing text (`Spiderm') on the canvas. An automatic alert notifying the player (see top left of screenshot) and identifying the text location (red box on canvas) is generated by our system DrawMon.

 

 

 

CanvasDash: an intuitive dashboard UI for annotation and visualization

labelling tool

An illustration of annotation using our Canvas-Dash interface.

atypict stats2

 

The distribution of atypical content categories show significant imbalance with category 'Individual letters' occurring more often than others.

DrawMon: a distributed system for sketchcontent-based alert generation

DrawMon - a distributed alert generation system (see figure below). Each game session is managed by a central Session Manager which assigns a unique session id.

 

  • For a given session, whenever a sketch stroke is drawn, the accumulated canvas content (i.e. strokes rendered so far) is tagged with session id and relayed to a shared Session Canvas Queue.
  • For efficiency, the canvas content is represented as a lightweight Scalable Vector Graphic (SVG) object. The contents of the Session Canvas Queue are dequeued and rendered into corresponding 512×512 binary images by Distributed Rendering Module in a distributed and parallel fashion.
  • The rendered binary images tagged with session id are placed in the Rendered Image Queue. The contents of Rendered Image Queue are dequeued and processed by Distributed Detection Module. Each Detection module consists of our custom-designed deep neural network CanvasNet.

PictGuess architecture

CanvasNet: a model for detecting atypical sketch instances

CanvasNet processes the rendered image as input and outputs a list of atypical activities (if any) along with associated meta-information (atypical content category, 2-D spatial location).

CanvasNet

DrawMon in Action

 

 

Paper

  • PDF: Paper
  • arXiv: Coming soon!
  • ACMMM-2022: Coming soon!

Code

The code for this work is available on GitHub!
Link: pictionary-cvit/drawmon

Acknowledgements

We wish to acknowledge grant from KCIS - TCS foundation.

 

Bibtex

Please consider citing the following works if you make use of our work:

@InProceedings{DrawMonACMMM2022,
author="Bansal, Nikhil
and Gupta, Kartik
and Kannan, Kiruthika
and Pentapati, Sivani
and Sarvadevabhatla, Ravi Kiran",
title="DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games",
booktitle = "ACM conference on Multimedia (ACMMM)",
year="2022"
}

Compressing Video Calls using Synthetic Talking Heads


Madhav Agarwal, Anchit Gupta , Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

BMVC, 2022

[ Interactive Demo ] | [ Demo Video ]

architecture final

We depict the entire pipeline used for compressing talking head videos. In our pipeline, we detect and send key points of alternate frames over the network and regenerate the talking heads at the receiver’s end. We then use frame interpolation to generate the rest of the frames and use super-resolution to generate high-resolution outputs

Abstract

We leverage the modern advancements in talking head generation to propose an end-to-end system for talking head video compression. Our algorithm transmits pivot frames intermittently while the rest of the talking head video is generated by animating them. We use a state-of-the-art face reenactment network to detect key points in the non-pivot frames and transmit them to the receiver. A dense flow is then calculated to warp a pivot frame to reconstruct the non-pivot ones. Transmitting key points instead of full frames leads to significant compression. We propose a novel algorithm to adaptively select the best-suited pivot frames at regular intervals to provide a smooth experience. We also propose a frame-interpolater at the receiver’s end to improve the compression levels further. Finally, a face enhancement network improves reconstruction quality, significantly improving several aspects like the sharpness of the generations. We evaluate our method both qualitatively and quantitatively on benchmark datasets and compare it with multiple compression techniques. A demo video is attached to the supplementary, providing qualitative results

Paper

  • Paper
    Compressing Video Calls using Synthetic Talking Heads

    Madhav Agarwal, Anchit Gupta, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar
    The 33rd British Machine Vision Conference, BMVC, 2022.
    [PDF ] | [BibTeX]

    @inproceedings{compressing2022bmvc,
    title={Compressing Video Calls using Synthetic Talking Heads},
    author={Agarwal, Madhav and Gupta, Anchit and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P and Jawahar, CV},
    booktitle={British Machine Vision Conference (BMVC)},
    year={2022} }

Demo

Your browser does not support the video tag.


Contact

  • Madhav Agarwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
  • Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

Audio-Visual Face Reenactment


Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       Univ. of Bath

WACV, 2023

[ Code ]   | [ Interactive Demo ] | [ Demo Video ]

architecture final

The overall pipeline of our proposed Audio Visual Face Reenactment network (AVFR-GAN) is given in this Figure. We take source and driving images along with their face mesh and segmentation mask to extract keypoints. An audio encoder extracts features from driving audio and use them provide attention on lip region. The audio and visual feature maps are warped together and passed to the carefully designed Identity-Aware Generator along with extracted features of source image to generate the final output.

Abstract

This work proposes a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense motion field generated using learnable keypoints. We improve the quality of lip sync using audio as an additional input, helping the network to attend to the mouth region. We use additional priors using face segmentation and face mesh to improve the structure of the reconstructed faces. Finally, we improve the visual quality of the generations by incorporating a carefully designed identity-aware generator module. The identity-aware generator takes the source image and the warped motion features as input to generate a high-quality output with fine-grained details. Our method produces state-of-the-art results and generalizes well to unseen faces, languages, and voices. We comprehensively evaluate our approach using multiple metrics and outperforming the current techniques both qualitative and quantitatively. Our work opens up several applications, including enabling low bandwidth video calls.

Paper

  • Paper
    Audio-Visual Face Reenactment

    Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar
    IEEE/CVF Winter Conference on Applications of Computer Vision,WACV, 2023.
    [PDF ] | [BibTeX]

    @InProceedings{Agarwal_2023_WACV,
    author = {Agarwal, Madhav and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C. V.},
    title = {Audio-Visual Face Reenactment},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    month = {January},
    year = {2023},
    pages = {5178-5187}
    }

Demo

Your browser does not support the video tag.


Contact

  1. Madhav Agarwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

My View is the Best View: Procedure Learning from Egocentric Videos


Siddhant Bansal  Chetan Arora and C.V. Jawahar

ECCV 2022

PAPER       DATASET      CODE

What is Procedure Learning?

Given multiple videos of a task, the goal is to identify the key-steps and their order to perform the task.

 procedure learning

Provided multiple videos of making a pizza, the goal is to identify the steps required to prepare the pizza and their order.

 

EgoProceL Dataset

Your browser does not support the video tag.

EgoProceL is a large-scale dataset for procedure learning. It consists of 62 hours of egocentric videos recorded by 130 subjects performing 16 tasks for procedure learning. EgoProceL contains videos and key-step annotations for multiple tasks from CMU-MMAC, EGTEA Gaze+, and individual tasks like toy-bike assembly, tent assembly, PC assembly, and PC disassembly.

Why an egocentric dataset for Procedure Learning?

Using third-person videos for procedure learning makes the manipulated object small in appearance and often occluded by the actor, leading to significant errors. In contrast, we observe that videos obtained from first-person (egocentric) wearable cameras provide an unobstructed and clear view of the action.
ECCV diagrams first person vs third person v1
Existing datasets majorly consist of third-person videos for procedure learning. Third-person videos contain issues like occlusion and atypical camera locations that makes them ill-suited for procedure learning. Additionally, the datasets rely on videos from YouTube that are noisy. In contrast, we propose to use egocentric videos that overcome the issues posed by third-person videos. Third-person frames in the figure are from ProceL and CrossTask and the first-person frames are from EgoProceL.

Overview of EgoProceL

EgoProceL consists of

  • 62 hours of videos captured by
  • 130 subjects
  • performing 16 tasks
  • maximum of 17 key-steps
  • average 0.38 foreground ratio
  • average 0.12 missing steps ratio
  • average 0.49 repeated steps ratio

Downloads

We recommend referring to the README before downloading the videos. Mirror link.

Videos

Link: OneDrive

Annotations

Link: OneDrive

CnC framework for Procedure Learning

We present a novel self-supervised Correspond and Cut (CnC) framework for procedure learning. CnC identifies and utilizes the temporal correspondences between the key-steps across multiple videos to learn the procedure. Our experiments show that CnC outperforms the state-of-the-art on the benchmark ProceL and CrossTask datasets by 5.2% and 6.3%, respectively.

ECCV diagrams Methodology v0 5

CnC takes in multiple videos from the same task and passes them through the embedder network trained using the proposed TC3I loss. The goal of the embedder network is to learn similar embeddings for corresponding key-steps from multiple videos and for temporally close frames. The ProCut Module (PCM) localizes the key-steps required for performing the task. PCM converts the clustering problem to a multi-label graph cut problem. The output provides the assignment of frames to the respective key-steps and their ordering.

Paper

  • PDF: Paper; Supplementary
  • arXiv: Paper; Abstract
  • ECCV: Coming soon!

Code

The code for this work is available on GitHub!

Link: Sid2697/EgoProceL-egocentric-procedure-learning

Acknowledgements

This work was supported in part by the Department of Science and Technology, Government of India, under DST/ICPS/Data-Science project ID T-138. A portion of the data used in this paper was obtained from kitchen.cs.cmu.edu and the data collection was funded in part by the National Science Foundation under Grant No. EEEC-0540865. We acknowledge Pravin Nagar and Sagar Verma for recording and sharing the PC Assembly and Disassembly videos at IIIT Delhi. We also acknowledge Jehlum Vitasta Pandit and Astha Bansal for their help with annotating a portion of EgoProceL.

 

Please consider citing if you make use of the EgoProceL dataset and/or the corresponding code:

 
@InProceedings{EgoProceLECCV2022,
author="Bansal, Siddhant
and Arora, Chetan
and Jawahar, C.V.",
title="My View is the Best View: Procedure Learning from Egocentric Videos",
booktitle = "European Conference on Computer Vision (ECCV)", 
year="2022"
}

@InProceedings{CMU_Kitchens,
author = "De La Torre, F. and Hodgins, J. and Bargteil, A. and Martin, X. and Macey, J. and Collado, A. and Beltran, P.",
title = "Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC) database.",
booktitle = "Robotics Institute",
year = "2008"
}

@InProceedings{egtea_gaze_p,
author = "Li, Yin and Liu, Miao and Rehg, James M.",
title =  "In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video",
booktitle = "European Conference on Computer Vision (ECCV)",
year = "2018"
}

@InProceedings{meccano,
    author    = "Ragusa, Francesco and Furnari, Antonino and Livatino, Salvatore and Farinella, Giovanni Maria",
    title     = "The MECCANO Dataset: Understanding Human-Object Interactions From Egocentric Videos in an Industrial-Like Domain",
    booktitle = "Winter Conference on Applications of Computer Vision (WACV)",
    year      = "2021"
}

@InProceedings{tent,
author = "Jang, Youngkyoon and Sullivan, Brian and Ludwig, Casimir and Gilchrist, Iain and Damen, Dima and Mayol-Cuevas, Walterio",
title = "EPIC-Tent: An Egocentric Video Dataset for Camping Tent Assembly",
booktitle = "International Conference on Computer Vision (ICCV) Workshops",
year = "2019"
}

Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors


Sindhu B Hegde* , Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad       University of Oxford       Univ. of Bath

ACM-MM, 2022

[ Code ]   | [ Paper ] | [ Demo Video ]

banner img v1

We solve the problem of upsampling extremely low-resolution (LR) talking-face videos to generate high-resolution (HR) outputs. Our approach exploits LR frames (8x8 pixels), corresponding audio signal and a single HR target identity image to synthesize realistic, high-quality talking-face videos (256x256 pixels).

Abstract

In this paper, we explore an interesting question of what can be obtained from an 8×8 pixel video sequence. Surprisingly, it turns out to be quite a lot. We show that when we process this 8x8 video with the right set of audio and image priors, we can obtain a full-length, 256x256 video. We achieve this 32x scaling of an extremely low-resolution input using our novel audio-visual upsampling network. The audio prior helps to recover the elemental facial details and precise lip shapes, and a single high-resolution target identity image prior provides us with rich appearance details. Our approach is an end-to-end multi-stage framework. The first stage produces a coarse intermediate output video that can be then used to animate the single target identity image and generate realistic, accurate and high-quality outputs. Our approach is simple and performs exceedingly well (an 8× improvement in FID score) compared to previous super-resolution methods. We also extend our model to talking-face video compression, and show that we obtain a 3.5x improvement in terms of bits/pixel over the previous state-of-the-art. The results from our network are thoroughly analyzed through extensive ablation and comparative analysis and demonstration videos (in the paper and supplementary material).

Paper

  • Paper
    Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors

    Sindhu B Hegde*, Rudrabha Mukhopadhyay*, Vinay Namboodiri and C.V. Jawahar
    Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors, ACM-MM, 2022.
    [PDF ] | [BibTeX]

    Updated Soon

Demo

--- COMING SOON ---


Contact

  1. Sindhu Hegde - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

More Articles …

  1. Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild
  2. ETL: Efficient Transfer Learning for Face Tasks
  3. Canonical Saliency Maps: Decoding Deep Face Models
  4. 3DHumans High-Fidelity 3D Scans of People in Diverse Clothing Styles
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.