Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback

 

Mohd Hozaifa Khan, Ravi Kiran Sarvadevabhatla

IIIT Hyderabad|CVIT

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

 

[Paper] [CVPR Paper] [Dataset (WIP)] [Demo ]

Audio Overview

 

 

 

 

Audio Overview

 

Abstract

 

We introduce Sketchtopia, a large-scale dataset and AI framework designed to explore goal-driven, multimodal communication through asynchronous interactions in a Pictionary-inspired setup. Sketchtopia captures natural human interactions, including freehand sketches, open-ended guesses, and iconic feedback gestures, showcasing the complex dynamics of cooperative communication under constraints. It features over 20K gameplay sessions from 916 players, capturing 263K sketches, 10K erases, 56K guesses and 19.4K iconic feedbacks.

We introduce multimodal foundational agents with capabilities for generative sketching, guess generation and asynchronous communication. Our dataset also includes 800 human-agent sessions for benchmarking the agents. We introduce novel metrics to characterize collaborative success, responsiveness to feedback and inter-agent asynchronous communication. Sketchtopia pushes the boundaries of multimodal AI, establishing a new benchmark for studying asynchronous, goal-oriented interactions between humans and AI agents.

 

Key Contributions

 

Rich Dataset

Large-scale, multimodal data capturing real-world asynchronous sketching dynamics.

Foundational Agents

DRAWBOT & GUESSBOT designed for asynchronous interaction.

New Metrics

Metrics like AAO, FRS, MATS for evaluation.

 

 

Dataset Highlights: Multimodal & Asynchronous

20K+

Sessions

Rich collection capturing diverse human Pictionary gameplay.

263K+

Sketches

Massive corpus of iterative freehand drawings for visual communication.

56K+

Open-ended Guesses

Natural language guesses reflecting understanding of visual cues.

19K+

Iconic Feedback

Non-verbal cues (👍👎❓) guiding the collaborative process asynchronously.

916

Players

Data from a diverse participant group ensuring robust analysis.

800

Human-Agent Sessions

Valuable data from humans interacting with our agents.

 

Sketchtopia Agents

ACTIONDECIDER: The Asynchronous Controller

The ActionDecider is the core component that enables asynchronous communication. It acts as a lightweight controller, continuously monitoring the game state (sketches, guesses, feedback) and deciding when agents should act and what action they should take. This allows for fluid, human-like interaction without the constraints of turn-taking, mirroring real-world communication dynamics.

DRAWBOT: The Sketcher

DRAWBOT visually communicates target word through asynchronous sketching, leveraging state-of-the-art generative models fine-tuned for iterative refinement based on communication context.

  • Generates sketches from target concepts, adapting to canvas state.
  • Refines drawings using feedback signals.
  • Adapts to 👍 👎 ❓.
  • Operates asynchronously, deciding when to draw or stay idle.

GUESSBOT: The Guesser

GUESSBOT interprets sketches and generates intelligent guesses, using a retrieval-based framework informed by historical interaction data.

  • Uses vision models on sketch canvas content using vision models.
  • Generates relevant guesses using efficient retrieval and filtering.
  • Acts asynchronously, deciding when new information warrants a guess.

 

Evaluating Agent Performance

 

🔀
AAO

Asynchronous Action Overlap

Measures concurrent actions between agents. Close AAO values to human suggests more natural, human-like interaction dynamics.

💬
FRS

Feedback Responsiveness Score

Quantifies how effectively agents adapt to feedback (👍👎) and move towards goal.

MATS

Multimodal Action Timing Similarity

Compares agent action timing patterns with human interactions to assess the naturalness of pacing.

 

Example Sessions

 

Authors

👤
Mohd Hozaifa Khan

IIIT Hyderabad

👤
Ravi Kiran Sarvadevabhatla

IIIT Hyderabad

Resources

Interactive Demo - Coming Soon!

Stay tuned for a live demo where you can experience Sketchtopia agents interacting.

In the meantime, explore the Dataset

 

Citation

@in proceedings{khan2025sketchtopia, 
author = {Sainithin Artham, Avijit Dasgupta, Shankar Gangisetty, and C. V. Jawahar}, title = {Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2025},
url = {https://sketchtopia25.github.io/},

 

Intellectual Property Notice

This work is the subject of a patent application filed in [India / under PCT] and is protected under applicable intellectual property laws. All rights to the underlying technology, including the AI agents for drawing and guessing in a Pictionary-like setting, are reserved.The system is currently under active research and development. Any use, reproduction, or commercial exploitation of this work or its components without prior written consent is prohibited..

Patent Application Status: Patent Pending.