Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback

Audio Overview

Abstract

We introduce Sketchtopia, a large-scale dataset and AI framework designed to explore goal-driven, multimodal communication through asynchronous interactions in a Pictionary-inspired setup. Sketchtopia captures natural human interactions, including freehand sketches, open-ended guesses, and iconic feedback gestures, showcasing the complex dynamics of cooperative communication under constraints. It features over 20K gameplay sessions from 916 players, capturing 263K sketches, 10K erases, 56K guesses and 19.4K iconic feedbacks.

We introduce multimodal foundational agents with capabilities for generative sketching, guess generation and asynchronous communication. Our dataset also includes 800 human-agent sessions for benchmarking the agents. We introduce novel metrics to characterize collaborative success, responsiveness to feedback and inter-agent asynchronous communication. Sketchtopia pushes the boundaries of multimodal AI, establishing a new benchmark for studying asynchronous, goal-oriented interactions between humans and AI agents.

Key Contributions

Rich Dataset

Large-scale, multimodal data capturing real-world asynchronous sketching dynamics.

Foundational Agents

DRAWBOT & GUESSBOT designed for asynchronous interaction.

New Metrics

Metrics like AAO, FRS, MATS for evaluation.

Dataset Highlights: Multimodal & Asynchronous

20K+

Sessions

Rich collection capturing diverse human Pictionary gameplay.

263K+

Sketches

Massive corpus of iterative freehand drawings for visual communication.

56K+

Open-ended Guesses

Natural language guesses reflecting understanding of visual cues.

19K+

Iconic Feedback

Non-verbal cues (👍👎❓) guiding the collaborative process asynchronously.

916

Players

Data from a diverse participant group ensuring robust analysis.

800

Human-Agent Sessions

Valuable data from humans interacting with our agents.

Sketchtopia Agents

ACTIONDECIDER: The Asynchronous Controller

The ActionDecider is the core component that enables asynchronous communication. It acts as a lightweight controller, continuously monitoring the game state (sketches, guesses, feedback) and deciding when agents should act and what action they should take. This allows for fluid, human-like interaction without the constraints of turn-taking, mirroring real-world communication dynamics.

ActionDecider: The Brains Behind Asynchronous Interaction

Multi-Modality Sketchtopia Agent Diagram

DRAWBOT: The Sketcher

DRAWBOT visually communicates target word through asynchronous sketching, leveraging state-of-the-art generative models fine-tuned for iterative refinement based on communication context.

Generates sketches from target concepts, adapting to canvas state.
Refines drawings using feedback signals.
Adapts to 👍 👎 ❓.
Operates asynchronously, deciding when to draw or stay idle.

Drawbot Architecture

Multimodal Architecture

GUESSBOT: The Guesser

GUESSBOT interprets sketches and generates intelligent guesses, using a retrieval-based framework informed by historical interaction data.

Uses vision models on sketch canvas content using vision models.
Generates relevant guesses using efficient retrieval and filtering.
Acts asynchronously, deciding when new information warrants a guess.

Guesserbot Architecture

Multimodal Architecture

Evaluating Agent Performance

🔀

AAO

Asynchronous Action Overlap

Measures concurrent actions between agents. Close AAO values to human suggests more natural, human-like interaction dynamics.

💬

FRS

Feedback Responsiveness Score

Quantifies how effectively agents adapt to feedback (👍👎) and move towards goal.

⏳

MATS

Multimodal Action Timing Similarity

Compares agent action timing patterns with human interactions to assess the naturalness of pacing.

Example Sessions

Target: ANGRY

Type: Human-Human

Key Guess: "Angry"

Feedback Given: 👍

Successful communication: Guesser guessed the correct emotion despite the sketch and feedback.

Target: ANGRY

Type: Human-Human

Key Guess: afraid, mute

Feedback Given: ❓

Failed communication: Guesser failed to guess the correct emotion despite the sketch and feedback.

Target: DUSTBIN

Type: Human-Human

Key Guess: "Dustbin"

Feedback Given: No feedback

Successful communication: GUESSBOT guessed the correct target word despite the sketch and feedback.

Target: DUSTBIN

Type: Human-Human

Key Guess: Fail guess:face etc

Feedback Given: 👎

Failed communication: Guesser failed to guess the correct target word despite the sketch and feedback.

Target: WALK

Type: Human-Agent

Key Guess: "Walk"

Feedback Given: No Feedback

Successful communication: Guesser guessed the correct target word despite the sketch and feedback.

Target: WALK

Type: Human-Agent

Key Guess: Fail Guess: man,run etc

Feedback Given: 👎,👍

Failed communication: Guesser failed to guess the correct target word despite the sketch and feedback.

❮ ❯

Authors

👤

Mohd Hozaifa Khan

IIIT Hyderabad

👤

Ravi Kiran Sarvadevabhatla

IIIT Hyderabad

Resources

Interactive Demo - Coming Soon!

Stay tuned for a live demo where you can experience Sketchtopia agents interacting.

In the meantime, explore the Dataset

Citation

@in proceedings{khan2025sketchtopia, 

     author = {Sainithin Artham, Avijit Dasgupta, Shankar Gangisetty, and C. V. Jawahar}, 
     title  = {Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback}, 

     booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 

     series = {}, 

     volume = {}, 

     pages = {}, 

    publisher = {}, 

    year = {2025}, 

    url = {https://sketchtopia25.github.io/},

Intellectual Property Notice

This work is the subject of a patent application filed in [India / under PCT] and is protected under applicable intellectual property laws. All rights to the underlying technology, including the AI agents for drawing and guessing in a Pictionary-like setting, are reserved.The system is currently under active research and development. Any use, reproduction, or commercial exploitation of this work or its components without prior written consent is prohibited..

Patent Application Status: Patent Pending.

Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback

Mohd Hozaifa Khan, Ravi Kiran Sarvadevabhatla

IIIT Hyderabad|CVIT

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

[Paper] [CVPR Paper] [Dataset (WIP)] [Demo ]

Audio Overview

Audio Overview

Abstract

Key Contributions

Rich Dataset

Foundational Agents

New Metrics

Dataset Highlights: Multimodal & Asynchronous

20K+

Sessions

263K+

Sketches

56K+

Open-ended Guesses

19K+

Iconic Feedback

916

Players

800

Human-Agent Sessions

Sketchtopia Agents

ACTIONDECIDER: The Asynchronous Controller

DRAWBOT: The Sketcher

GUESSBOT: The Guesser

Evaluating Agent Performance

AAO

FRS

MATS

Example Sessions

Authors

Mohd Hozaifa Khan

Ravi Kiran Sarvadevabhatla

Resources

Interactive Demo - Coming Soon!

Citation

Intellectual Property Notice