CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback

 

Mohd Hozaifa Khan, Ravi Kiran Sarvadevabhatla

IIIT Hyderabad|CVIT

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

 

[Paper] [CVPR Paper] [Dataset (WIP)] [Demo ]

Audio Overview

 

Sketchtopia Teaser Visual 1 - Example Sketch Interaction
Sketchtopia Teaser Visual 2 - Diverse Sketches from Dataset
Sketchtopia Teaser Visual 3 - Asynchronous Communication Example

 


distil1

 

 Figure. Illustration of a driving scenario where the ADAS vehicle predicts a left lane change (what) to avoid slower traffic ahead (why). Existing DIP models lacking reasoning may miss such cues, while our framework jointly learns and distills both maneuver and explanation, improving decision quality.

 

Abstract

 

Predicting a drivers’ intent (e.g., turns, lane changes) is a critical capability for modern Advanced Driver Assistance Prev. DIP: Only Intent Prediction (What Maneuver) Our DIP: Intent Prediction (What Maneuver) with Explanation (Why) (what) Left Lane Change Systems (ADAS). While recent Multimodal Large Language Models (MLLMs) show promise in general vision-language tasks, we find that zeroshot MLLMs still lag behind domain specific approaches for Driver Intention Prediction (DIP). To address this, we introduce DriveXplain, a zero-shot framework based on MLLMs that leverages rich visual cues such as optical flow and road semantics to automatically generate both intention maneuver (what) and rich natural language explanations (why). These maneuver–explanation pairs are then distilled into a compact MLLM, which jointly learns to predict intentions and corresponding explanations. We show that incorporating explanations during training leads to substantial gains over models trained solely on labels, as distilling explanations instills reasoning capabilities by enabling the model to understand not only what decisions to make but also why those decisions are made. Comprehensive experiments across structured (Brain4Cars, AIDE) and unstructured (DAAD) datasets demonstrate that our approach achieves state-of-the-art results in DIP task, outperforming zero-shot and domain-specific baselines. We also present ablation studies to evaluate key design choices in our framework. This work sets a direction for more explainable and generalizable intention prediction in autonomous driving systems. We plan to release our codebase to support research.

 

 Methodology

 

distil2

 

 Figure. Our proposed framework for the DIP task. DriveXplain generates natural language explanations alongside maneuvers and Explanation Distillation distills these explanations into a single MLLM to enhance DIP performance at inference.

 

 

 

Key Highlights:

 

  • New Task: Understanding why a driver makes a decision is just as important as predicting what they’ll do next. 
  • DriveXplain Model: We introduce a zero-shot framework that enhances MLLMs for ADAS by embedding driving-specific context directly into their reasoning.
  •  Knowledge Distillation: To enable real-time, deployable solutions, we distill reasoning and decision-making capabilities from large MLLMs into smaller, efficient models — paving the way for explainable driving intelligence.

 

 Results

 

data distribution ego gaze explanations 

 

 Table DIP benchmark results. Performance comparison of Driving-specific VLM, General VLMs, Action Anticipation models, and our framework (DriveXplain, ED).Accuracy (Acc.) and F1 (%) on Brain4Cars AIDE, and DAAD datasets. Finetune indicates whether the model was fine-tuned (✓) or evaluated in a zero-shot (✗) setting. Bold and underline indicate the best and second-best results. 

 

 distil4

 

 Figure: Qualitative comparison of proposed framework, zero-shot Qwen2.5-VL, Dolphins across Brain4cars, AIDE, and DAAD datasets. We show manoeuvre prediction (what) and explanation (why), with attention heatmaps highlighting key regions. 

 

 

 

Citation

@in proceedings{vcbm2025daadx, 
author = {Sainithin Artham, Avijit Dasgupta, Shankar Gangisetty, and C. V. Jawahar}, title = {Distilling What and Why: Enhancing Driver Intention Prediction w ith MLLMs},
booktitle = {},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2025},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

 

 

 

IndicDLP : A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing 

ICDAR 2025 (Oral) — 🏆 Best Student Paper Runner-Up Award 

Oikantik Nath1, Sahithi Kukkala1, Mitesh Khapra1, Ravi Kiran Sarvadevabhatla2

IIIT Madras1, IIIT Hyderabad2

[Paper]       [Code]     [Dataset]

 

 

Acts_RulesBrochuresC Forms

(A) Acts & Rules  (B) Brochures   (C) Forms

 

MagazinesManualsNewspapers

  (D) Magazines                                                                              (E) Manuals                                                                        (F) Newspapers

 

 

NoticesNovelsQuuestion papers

  (G) Notices                                                                              (H) Novels                                                                        (I) Question Papers

 

Research PapersSyllabiNewspapers

  (J) Research Papers                                                                              (k) Syllabi                                                                        (L) Textbooks

         

Samples from the IndicDLP dataset highlighting its diversity across document formats, domains, languages, and temporal span. For improved differentiability, segmentation masks are used instead of bounding boxes to highlight regions more effectively.

 

dataset1

 

 

 

The above figure illustrates the contributions of 12 languages (left) and 12 document domains (right) in the IndicDLP dataset. The distribution is fairly balanced across both categories, with no single language or domain overwhelmingly dominating the dataset. This ensures a diverse and well-represented collection.

 

dataset2

Comparison of modern document layout parsing datasets.

 

Citation

Please cite our paper if you find this dataset or work useful:

@Inproceedings{10.1007/978-3-032-04614-7_2,
  author       = {Oikantik Nath, Sahithi Kukkala, Mitesh Khapra, Sarvadevabhatla, Ravi Kiran},
  editor      = {Xu-Cheng Yin, Dimosthenis Karatzas, and and Daniel Lopresti  },
  title        = {IndicDLP: A Foundational Dataset for Multi-lingual and Multi-domain Document Layout Parsing},
  booktitle    = {Document Analysis and Recognition -- ICDAR 2025},
  year         = {2026}
  publisher    = {Springer Nature Switzerland},
  address      = {Cham},
  pages        = {23--39},
  abstract     = {Document layout analysis is essential for downstream tasks such as information retrieval, 
extraction, OCR, and digitisation. However, existing large-scale datasets like PubLayNet and DocBank lack
fine-grained region labels and multilingual diversity, making them insufficient for representing complex documents
layouts. Human-annotated datasets such as {\$}{\$}M^{\{}6{\}}Doc{\$}{\$}M6Doc and {\$}{\$}{\backslash}text
{\{}D{\}}^{\{}4{\}}{\backslash}text {\{}LA{\}}{\$}{\$}D4LA offer richer labels and greater domain diversity,
but are too small to train robust models and lack adequate multilingual coverage. This gap is especially
pronounced for Indic documents, which encompass diverse scripts yet remain underrepresented in current datasets,
further limiting progress in this space. To address these shortcomings, we introduce IndicDLP, a large-scale
foundational document layout dataset spanning 11 representative Indic languages alongside English and 12 common
document domains. Additionally, we curate UED-mini, a dataset derived from DocLayNet
and {\$}{\$}M^{\{}6{\}}Doc{\$}{\$}M6Doc, to enhance pretraining and provide a solid foundation for Indic layout
models. Our experiments demonstrate that fine-tuning existing English models on IndicDLP significantly boosts
performance, validating its effectiveness. Moreover, models trained on IndicDLP generalise well beyond Indic
layouts, making it a valuable resource for document digitisation. This work bridges gaps in scale, diversity, and
annotation granularity, driving inclusive and efficient document understanding.}  isbn     = {978-3-032-04614-7}

Acknowledgments

Assamese
Yuvaraj - Superchecker
Rondeep Bordoloi - Reviewer

Ajit Kumar Sarma - Annotator
Anjali Steephan - Annotator
Madhutrishna Chetia - Annotator
Riya Chutia - Annotator
Ruh Ullah Khan - Annotator

Card link Another link
Bengali
Praneeth Reddy Superchecker
Rondeep Bordoloi Reviewer

Gargi Mukherjee Kolley - Annotator
Madhumita Pal - Annotator
PRIYANJANA BANERJEE - Annotator
Soupat Biswas - Annotator
Sushmita Pal - Annotator

Card link Another link
Bengali
Praneeth Reddy - Superchecker

Gargi Mukherjee Kolley - Annotator
Madhumita Pal - Annotator
PRIYANJANA BANERJEE - Annotator
Soupat Biswas - Annotator
Sushmita Pal - Annotator.

Card link Another link

 

 

This work is supported by iHub-Data and mobility at IIIT Hyderabad.

 

 

MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding

 

Varun Paturkar, Shankar Gangisetty, C V Jawahar

IIIT Hyderabad

 

[Paper] [Code] [Dataset] [Project Webpage]

 


Motor IMG

 Fig: Comparison of traffic contexts and accident statistics across the Global North and South. Top Row: Four-wheelers dominating in the USA vs Two-wheelers in India. Bottom Row: Distribution of vehicles (two-wheeler vs four-wheeler) and fatal accidents across North and South.

Abstract

Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on rider behavior, however, lags far behind four-wheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we present the MOtorized TwO-wheeler Rider (MOTOR) dataset, the first large-scale, multi-view, multimodal resource dedicated to two-wheelers in dense, unstructured traffic. MOTOR com- prises 2,500 sequences (25+ hours) collected from 16 riders and integrates synchronized front, rear, and helmet videos, rider eye-gaze from wearable trackers, on-road audio, and telemetry (GPS, accelerometer, gyroscope). Rich annotations capture traf- fic context, rider state, 12 riding maneuvers spanning conven- tional and unconventional behaviors, and legality labels (Legal, Illegal, Unspecified). We benchmark rider behavior recognition and maneuver legality classification using state-of-the-art video action recognition backbones (CNN and Transformer-based), extended with multimodal fusion, and find that combining RGB, gaze, and telemetry consistently yields the best performance. MOTOR thus provides a unique foundation for advancing safety-critical understanding of two-wheeler riding. It offers the research community a benchmark to develop and evaluate models for behavior analysis, legality-aware prediction, and intelligent transportation systems. Dataset and code will be made publicly available.

 

 The MOTOR Dataset

 Table: Comparison of 4-wheeler and 2-wheeler behavior datasets. Our dataset is unique as it contains multi-modal, multi-view videos from ego-vehicle and helmet, eye gaze, as well as annotated conventional and unconventional behaviors, and legality-related riding scenarios. Note: CRB indicates conventional riding behaviors, and UCRB means unconventional riding behaviors.

Motor IMG2

 

Motor IMG3

 Fig: Data samples helmet-view. (a) Ego-rider weaves through dense, slow traffic, overtaking multiple vehicles across lanes. (b) Rider squeezes through a narrow gap between a bus and a car, narrowly avoiding the bus. (c) Rider rides in the wrong lane against dense oncoming traffic, disrupting flow. (d) Rider turns head fully toward a roadside building, diverting gaze from the road amid fast-moving traffic.

 

 Results

 Table: Rider Behavior Classification. Comparison of CNN and Transformer-based baselines on MOTOR dataset across different modality combinations.

Motor IMG4

 

 Table: Rider Legality Classification: CNN and Transformer-based baselines on MOTOR dataset across different modality combinations.

Motor IMG5

 

Citation

@in proceedings{motor2026icra, 
author = {Varun Paturkar, Shankar Gangisetty, and C. V. Jawahar}, title = {MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behaviour Understanding},
booktitle = {},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2026},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

 

 

PedestrianQA : A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

 

Naman Mishra, Shankar Gangisetty, C V Jawahar

IIIT Hyderabad

 

[Paper] [Code] [Dataset]

 


pedesIMG

 Fig: An illustration of an unstructured-traffic scenario where a pedestrian stands in the middle of the road in front of the ego-vehicle, attempting to cross the road. Unlike prior approaches that provide only predictions, our method predicts the intention and trajectory and generates supporting rationales.

Abstract

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision–language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural lan- guage reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as a question–answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling. The dataset and model will be made publicly available.

 

 PedestrianQA Dataset

 

pedesIMG

 Fig:Data generation pipeline: We first aggregate all ground-truth, human-annotated annotations from the constituent datasets into a unified metadata schema. We use generated VLM captions to enrich motion semantics using carefully designed pedestrian-motion prompts that target fine-grained cues. These captions are validated for format and appended to the metadata. We then construct a single instruction package containing: (i) a system prompt, (ii) task definitions for PIP and PTP, (iii) step-by-step guidance for producing structured, fine-grained rationales, (iv) a small set of in-context exemplars, (v) a compliance checklist for high-quality rationale generation, and (vi) the sequence-level metadata tables. This package is provided to claude-sonnet-4-20250514 LLM-API to generate triplets of questions, answers, and rationales.

 

 Results

pedesIMG

 

pedesIMG

 

pedesIMG

 

 Table: Rationale evaluation on the combined dataset, with Claude-Sonnet-4. Average scores (0–100) for Spatial Reasoning (SR), Temporal Reasoning (TR), Mathematical Reasoning (MR), Ego-Vehicle Reasoning (EVR), Scene-Context Reasoning (SCR), Final Destination Prediction (FDP), and Conclusion (C). ✓ indicates finetuned models, ✗ zero-shot. Bold shows best score per column; underline marks the second-best. Dolphins generates only a brief conclusion and does not generate category-specific rationales.   

pedesIMG 

Citation

@in proceedings{pedqa2026icra, 
author = {Naman Mishra, Shankar Gangisetty, and C. V. Jawahar},
title = {PedestrainQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction}, booktitle = {},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2026},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

 

 

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

 

Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta,
C V Jawahar

IIIT Hyderabad

 

[Paper] [Code] [Dataset]

 


DriveSafe IMG

 Fig: Previous works in driving scenarios primarily address risk perception but fall short of offering actionable safety guidance. Similarly, general-purpose MLLMs (Qwen2.5-VL, LLaVA-NeXT, VideoLLaMA3) are still unreliable in this regard. In contrast, our approach, DriveSafe, integrates risk assessment with clear, human-understandable safety suggestions.

Abstract

Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environ-ments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision–language tasks, our findings indicate that zero-shot MLLMs still under- perform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene under- standing that leverages structured natural language descrip- tions. Specifically, our method first generates spatially grounded captions enriched with multimodal context—including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption–risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. We plan to release our codebase to support future research.

 

 Methodology

DriveSafe2

 Fig: Our proposed DriveSafe framework for the caption generation and safety suggestion task in driving. We first derive contextual cues to guide caption generation, and then use the resulting captions for risk assessment and safety suggestion.

 

 Results

DriveSafe3

 Table: Performance comparison of Caption Generation and Risky Object Grounding across Existing Methods, General VLMs, and DriveSafe on the DRAMA dataset. 

 

DriveSafe4 

 Fig: Qualitative comparison of DriveSafe-ZeroShot, DriveSafe-Finetuned, and Qwen2.5-VL on three driving scenarios from the DRAMA dataset. Risky object grounding is shown with bounding boxes so are respective models with text highlighting, while generated captions and safety suggestions are marked as correct (green) or incorrect (red).

 

Citation

@in proceedings{drivesafe2026icra, 
author = {Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta, and C. V. Jawahar}, title = {DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios},
booktitle = {},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2026},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

 

 

More Articles …

  1. Distilling What and Why: Enhancing Driver Intention Prediction with MLLM
  2. TexTAR – Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images
  3. Towards Scalable Sign Production: Leveraging Co-Articulated Gloss Dictionary for Fluid Sign Synthesis
  4. What is there in an Indian Thali?
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.