CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Summer School 2026
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives

 

Chirag Parikh, Deepti Rawat,Rakshitha R. T., Tathagata Ghosh, Ravi Kiran Sarvadevabhatla

iHub-Data|CVIT|IIIT Hyderabad

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

 

[Paper] [arxiv] [Dataset] [Code] [Checkpoints]

 

 

Your browser does not support the video tag.

What is RoadSocial

RoadSocial is a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. It differentiates itself from existing datasets by capturing the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. RoadSocial highlights:

  • 14M frames, 414K social comments
  • 13.2K videos (7.9K minutes)
  • 674 unique video tags (total 100K+)
  • 260K high-quality socially-informed QA pairs
  • Scalable QA generation pipeline using social video narratives
  • 12 challenging video QA tasks for generic road event understanding
  • New tasks to test robustness of Video LLMs to hallucination
  • Improves generic road event understanding capability of Video LLMs
  • Critical insights onto zero-shot capabilities of 18 Video LLMs

 

Dataset Examples

 

 

Dataset Statistics

carimage
image
image
image
image
image
image
image
image
image
image
 

 

VideoQA Leaderboard

GPT-3.5 score (scale: 0-100) is reported for all tasks except Temporal Grounding (TG). Overall scores are reported for ALL QA tasks, Road-event Tasks (RT), Generic QAs, and Specific QAs.

Abbreviations: F, C, I, H, WR, KE, VP, DS, WY, CQ, AD, IN, CF, AV, IC.
#ModelParamsFactualComplexImaginativeHallucinationOverallOverallOverallOverall
WRKEVPDSWYCQTGADINCFAVIC(ALL)(RT)(Generic)(Specific)
1 Dolphin 9B 61.3 34.5 67.8 35.8 25.2 37.2 0.01 49.8 39.1 45.5 71.8 21.3 40.8 42.5 29.8 46.5
2 GPT-4o - 77.0 66.6 84.3 70.2 70.8 72.1 7.8 77.7 76.4 77.0 90.0 67.6 69.8 70.0 69.5 74.4
3 Gemini-1.5-Pro - 77.7 56.7 85.4 61.9 61.4 60.1 18.6 72.1 70.2 75.7 72.3 48.7 63.4 64.7 60.1 68.3
4 InternVL2 76B 72.4 51.3 81.4 57.1 59.0 62.1 1.07 70.5 67.0 69.2 58.6 27.6 56.4 59.1 55.5 65.1
5 Qwen2-VL 72B 76.6 56.6 85.1 60.2 64.0 67.6 0.01 71.9 72.4 71.6 37.0 40.2 58.6 60.3 58.3 68.8
6 LLaVA-Video 72B 75.8 52.4 76.8 52.4 55.0 52.2 9.94 68.3 63.7 64.9 83.5 24.7 56.7 59.6 51.1 63.3
7 LLaVA-OV 72B 75.1 54.1 78.7 53.0 53.3 54.1 3.99 67.8 61.9 63.1 45.1 19.9 52.5 55.5 51.8 63.0
8 VITA 8x7B 66.6 52.1 71.6 48.1 55.6 56.3 2.27 66.7 66.0 62.4 56.3 22.0 52.2 54.9 49.8 60.4
9 Tarsier 34B 73.7 58.1 78.2 58.2 59.0 58.8 0.32 71.6 71.1 67.4 83.2 82.3 63.5 61.8 58.4 66.1
10 ARIA 25.3B 75.4 53.1 86.2 58.4 56.9 70.2 8.96 75.1 74.7 74.0 86.4 29.2 62.4 65.4 56.7 68.5
11 InternVL2 8B 67.7 51.7 78.0 55.7 59.3 60.9 0.77 66.7 66.8 70.0 68.1 26.1 56.0 58.7 53.7 64.0
12 Mini-CPM-V 2.6 8B 77.7 57.6 80.6 55.0 50.5 57.5 0.4 61.6 52.3 59.3 73.5 30.0 54.7 56.9 51.0 62.0
13 IXC-2.5 7B 78.5 58.7 85.4 61.7 65.3 68.5 0.69 73.9 75.6 75.7 85.8 29.2 63.3 66.4 60.7 70.3
14 Tarsier 7B 69.9 54.7 72.3 52.0 53.4 55.2 0.11 69.5 69.3 63.5 79.1 67.3 58.9 58.1 54.0 61.7
15 LongVU 7B 73.0 53.0 76.3 51.1 50.2 55.0 0.84 59.7 55.8 58.2 48.9 32.7 51.2 52.9 47.7 59.7
16 Qwen2-VL 7B 75.5 52.8 76.1 52.7 57.7 56.4 0.59 69.2 71.6 65.9 37.5 39.6 54.6 56.0 52.6 63.9
17 LLaVA-Video 7B 74.6 50.1 76.7 52.1 50.1 50.3 1.43 60.4 53.8 58.7 61.8 23.5 51.1 53.6 47.6 59.7
18 LLaVA-OV 7B 73.4 51.2 77.2 50.7 51.7 51.2 0.97 62.8 55.4 58.6 45.4 21.1 50.0 52.6 48.4 59.8
19 LLaVA-OV ft. 7B 80.9 64.1 85.7 64.1 68.7 65.1 4.49 74.2 70.9 71.7 95.4 87.6 69.4 67.8 65.1 69.7

Submit your results: This email address is being protected from spambots. You need JavaScript enabled to view it.

 

Video Presentation

 
References: InternVL2 [4]; MM-AU [5]; VITA [6]; BDD-X [8]; LLaVA-OV [10]; ARIA [11]; Dolphin [13]; DRAMA [15]; LingoQA [16]; GPT-4o [18]; Rank2Tell [24]; LongVU [25]; DriveLM [26]; ROAD [27]; Gemini-1.5-Pro [28]; Tarsier [30]; Qwen2-VL [31]; SUTD-TrafficQA [32]; BDD-OIA [33]; Mini-CPM-V [35]; IXC-2.5 [36]; LLaVA-Video [37]

 

Citation

@misc{parikh2025roadsocialdiversevideoqadataset, 
author = {Chirag Parikh and Deepti Rawat and Rakshitha R. T. and Tathagata Ghosh and Ravi Kiran Sarvadevabhatla},
year = {2025},
eprint = {2503.21459},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2503.21459},

 

 

Enhancing Driving Visibility via Semantic-Guided Knowledge Distillation Framework for Adverse Weather Removal

 

Hanvitha Saraswathi Mukkamala, Shankar Gangisetty, Ananya Kulkarni,

Veera Ganesh Yalla, C V Jawahar

IIIT Hyderabad

 

[Paper] [Code]

 


weatherremoval1

 Figure1. Illustration of our framework on ADAS vehicle in a rainy driving scenario for Adverse Weather Removal.

Abstract

Adverse weather such as rain, haze, and low light severely degrades visual perception in Advanced Driver Assistance Systems (ADAS) and autonomous driving, leading to degraded scene understanding and increased safety risks. We propose a unified, semantic-guided knowledge distillation restoration framework that addresses multi- weather removal while preserving semantics. Our method employs a semantic-guided dual-decoder architecture trained via two-stage multi-teacher knowledge distillation, transferring expertise from multiple high-capacity models into a lightweight student model. Segmentation-aware contrastive learning further aligns low-level restoration with high-level semantic structure, enabling robust detection of roads, vehicles, and pedestrians under challenging conditions. Trained on a mix of synthetic and real-world data with segmentation-guided feature refinement, our framework gener- alises effectively to real-world unseen environments. Extensive experiments on multiple benchmarks show competitive or superior performance to state-of-the-art methods, with real-time inference suitable for edge deployment. This makes our approach well-suited for safety-critical perception in autonomous and semi-autonomous systems operating in adverse outdoor environments.

 Methodology

Our method builds a unified lightweight student model for adverse weather removal by distilling knowledge from multiple weather-specific teacher networks. The key idea is to combine image restoration with semantic guidance, so the model not only improves visibility but also preserves important scene structures such as roads, vehicles, and pedestrians.

weatherremoval2

 

 Figure2. Overview of our semantic-guided knowledge distil- lation framework. Collaborative distillation from multiple weather-specific teachers (rain, haze) to a unified student via two stages: (1) Knowledge Collation, where soft alignment with teacher reconstructions and segmentation priors, and (2) Knowledge Examination, where enforcing ground-truth consistency with hard constraints. Segmentation maps provide region-aware supervision, emphasizing critical structures like roads and vehicles.

At inference time, only the lightweight student model is used. This makes the framework efficient, memory-friendly, and suitable for real-time deployment without requiring any teacher model during testing.

 

QUANTITATIVE RESULTS

  Our method achieves strong performance across both synthetic and real-world adverse weather datasets, showing consistent improvements in restoration quality while preserving semantic structure. 

weatherremoval3

 Table 1:Quantitative evaluation of image dehazing performance on synthetic and real datasets. 

 

weatherremoval4

 Table 1:Quantitative evaluation of image dehazing performance on synthetic and real datasets. 

 

weatherremoval5

 Table 3:Quantitative comparison on real-world derained and dehazed images from IDD-AW using NIQE and BRISQUE. 

 

QUALITATIVE RESULTS

weatherremoval6

 Figure 3: Visual comparison of the proposed and existing methods from real datasets (Raindrop, SPA ,O-HAZE ) for multi-weather restoration. The image can be zoomed in for improved visualization. 

 

weatherremoval7

 Figure 4: Visual comparison of the proposed and existing methods from synthetic datasets (Outdoor-Rain , RESIDE) for multi-weather restoration. The image can be zoomed in for improved visualization. 

 
weatherremoval8

 Figure 5: Qualitative comparison of real-world rain and haze images with low-light from IDD-AW dataset. 

 

Citation


@in proceedings{Enhancing2025icvgip,
author = {Hanvitha Saraswathi Mukkamala, Shankar Gangisetty, Ananya Kulkarni, Veera Ganesh Yalla and C. V. Jawahar}, title = {Enhancing Driving Visibility via Semantic-Guided Knowledge Distillation Framework for Adverse Weather Removal},
booktitle = {},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2025},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

 

 

Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback

 

Mohd Hozaifa Khan, Ravi Kiran Sarvadevabhatla

IIIT Hyderabad|CVIT

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

 

[Paper] [CVPR Paper] [Dataset (WIP)] [Demo ]

Audio Overview

 

 

 

 

Audio Overview

Download Audio

 

Abstract

 

We introduce Sketchtopia, a large-scale dataset and AI framework designed to explore goal-driven, multimodal communication through asynchronous interactions in a Pictionary-inspired setup. Sketchtopia captures natural human interactions, including freehand sketches, open-ended guesses, and iconic feedback gestures, showcasing the complex dynamics of cooperative communication under constraints. It features over 20K gameplay sessions from 916 players, capturing 263K sketches, 10K erases, 56K guesses and 19.4K iconic feedbacks.

We introduce multimodal foundational agents with capabilities for generative sketching, guess generation and asynchronous communication. Our dataset also includes 800 human-agent sessions for benchmarking the agents. We introduce novel metrics to characterize collaborative success, responsiveness to feedback and inter-agent asynchronous communication. Sketchtopia pushes the boundaries of multimodal AI, establishing a new benchmark for studying asynchronous, goal-oriented interactions between humans and AI agents.

 

Key Contributions

 

Rich Dataset

Large-scale, multimodal data capturing real-world asynchronous sketching dynamics.

Foundational Agents

DRAWBOT & GUESSBOT designed for asynchronous interaction.

New Metrics

Metrics like AAO, FRS, MATS for evaluation.

 

 

Dataset Highlights: Multimodal & Asynchronous

20K+

Sessions

Rich collection capturing diverse human Pictionary gameplay.

263K+

Sketches

Massive corpus of iterative freehand drawings for visual communication.

56K+

Open-ended Guesses

Natural language guesses reflecting understanding of visual cues.

19K+

Iconic Feedback

Non-verbal cues (👍👎❓) guiding the collaborative process asynchronously.

916

Players

Data from a diverse participant group ensuring robust analysis.

800

Human-Agent Sessions

Valuable data from humans interacting with our agents.

 

Sketchtopia Agents

ACTIONDECIDER: The Asynchronous Controller

The ActionDecider is the core component that enables asynchronous communication. It acts as a lightweight controller, continuously monitoring the game state (sketches, guesses, feedback) and deciding when agents should act and what action they should take. This allows for fluid, human-like interaction without the constraints of turn-taking, mirroring real-world communication dynamics.

ActionDecider: The Brains Behind Asynchronous Interaction

Multi-Modality Sketchtopia Agent Diagram

DRAWBOT: The Sketcher

DRAWBOT visually communicates target word through asynchronous sketching, leveraging state-of-the-art generative models fine-tuned for iterative refinement based on communication context.

  • Generates sketches from target concepts, adapting to canvas state.
  • Refines drawings using feedback signals.
  • Adapts to 👍 👎 ❓.
  • Operates asynchronously, deciding when to draw or stay idle.

Drawbot Architecture

Multimodal Architecture

GUESSBOT: The Guesser

GUESSBOT interprets sketches and generates intelligent guesses, using a retrieval-based framework informed by historical interaction data.

  • Uses vision models on sketch canvas content using vision models.
  • Generates relevant guesses using efficient retrieval and filtering.
  • Acts asynchronously, deciding when new information warrants a guess.

Guesserbot Architecture

Multimodal Architecture

 

Evaluating Agent Performance

 

🔀
AAO

Asynchronous Action Overlap

Measures concurrent actions between agents. Close AAO values to human suggests more natural, human-like interaction dynamics.

💬
FRS

Feedback Responsiveness Score

Quantifies how effectively agents adapt to feedback (👍👎) and move towards goal.

⏳
MATS

Multimodal Action Timing Similarity

Compares agent action timing patterns with human interactions to assess the naturalness of pacing.

 

Example Sessions

 

Target: ANGRY

Type: Human-Human

Key Guess: "Angry"

Feedback Given: 👍

Successful communication: Guesser guessed the correct emotion despite the sketch and feedback.

Target: ANGRY

Type: Human-Human

Key Guess: afraid, mute

Feedback Given: ❓

Failed communication: Guesser failed to guess the correct emotion despite the sketch and feedback.

Target: DUSTBIN

Type: Human-Human

Key Guess: "Dustbin"

Feedback Given: No feedback

Successful communication: GUESSBOT guessed the correct target word despite the sketch and feedback.

Target: DUSTBIN

Type: Human-Human

Key Guess: Fail guess:face etc

Feedback Given: 👎

Failed communication: Guesser failed to guess the correct target word despite the sketch and feedback.

Target: WALK

Type: Human-Agent

Key Guess: "Walk"

Feedback Given: No Feedback

Successful communication: Guesser guessed the correct target word despite the sketch and feedback.

Target: WALK

Type: Human-Agent

Key Guess: Fail Guess: man,run etc

Feedback Given: 👎,👍

Failed communication: Guesser failed to guess the correct target word despite the sketch and feedback.

❮ ❯

Authors

👤
Mohd Hozaifa Khan

IIIT Hyderabad

👤
Ravi Kiran Sarvadevabhatla

IIIT Hyderabad

Resources

Interactive Demo - Coming Soon!

Stay tuned for a live demo where you can experience Sketchtopia agents interacting.

In the meantime, explore the Dataset

 

Citation

@in proceedings{khan2025sketchtopia, 
author = {Sainithin Artham, Avijit Dasgupta, Shankar Gangisetty, and C. V. Jawahar}, title = {Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2025},
url = {https://sketchtopia25.github.io/},

 

Intellectual Property Notice

This work is the subject of a patent application filed in [India / under PCT] and is protected under applicable intellectual property laws. All rights to the underlying technology, including the AI agents for drawing and guessing in a Pictionary-like setting, are reserved.The system is currently under active research and development. Any use, reproduction, or commercial exploitation of this work or its components without prior written consent is prohibited..

Patent Application Status: Patent Pending.

 

 

 

IndicDLP : A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing 

ICDAR 2025 (Oral) — 🏆 Best Student Paper Runner-Up Award 

Oikantik Nath1, Sahithi Kukkala2, Mitesh Khapra1, Ravi Kiran Sarvadevabhatla2

IIIT Madras1, IIIT Hyderabad2

[Paper]       [Code]     [Dataset]

 

Indic1

 

Indic2

 

Indic3

 

Indic4

 

 

Samples from the IndicDLP dataset highlighting its diversity across document formats, domains, languages, and temporal span. For improved differentiability, segmentation masks are used instead of bounding boxes to highlight regions more effectively.

 

dataset1

 

 

 

The above figure illustrates the contributions of 12 languages (left) and 12 document domains (right) in the IndicDLP dataset. The distribution is fairly balanced across both categories, with no single language or domain overwhelmingly dominating the dataset. This ensures a diverse and well-represented collection.

 

dataset2

Comparison of modern document layout parsing datasets.

 

Citation

Please cite our paper if you find this dataset or work useful:

@Inproceedings{10.1007/978-3-032-04614-7_2,
  author       = {Oikantik Nath, Sahithi Kukkala, Mitesh Khapra, Sarvadevabhatla, Ravi Kiran},
  editor      = {Xu-Cheng Yin, Dimosthenis Karatzas, and and Daniel Lopresti  },
  title        = {IndicDLP: A Foundational Dataset for Multi-lingual and Multi-domain Document Layout Parsing},
  booktitle    = {Document Analysis and Recognition -- ICDAR 2025},
  year         = {2026}
  publisher    = {Springer Nature Switzerland},
  address      = {Cham},
  pages        = {23--39},
  abstract     = {Document layout analysis is essential for downstream tasks such as information retrieval, 
extraction, OCR, and digitisation. However, existing large-scale datasets like PubLayNet and DocBank lack
fine-grained region labels and multilingual diversity, making them insufficient for representing complex documents
layouts. Human-annotated datasets such as {\$}{\$}M^{\{}6{\}}Doc{\$}{\$}M6Doc and {\$}{\$}{\backslash}text
{\{}D{\}}^{\{}4{\}}{\backslash}text {\{}LA{\}}{\$}{\$}D4LA offer richer labels and greater domain diversity,
but are too small to train robust models and lack adequate multilingual coverage. This gap is especially
pronounced for Indic documents, which encompass diverse scripts yet remain underrepresented in current datasets,
further limiting progress in this space. To address these shortcomings, we introduce IndicDLP, a large-scale
foundational document layout dataset spanning 11 representative Indic languages alongside English and 12 common
document domains. Additionally, we curate UED-mini, a dataset derived from DocLayNet
and {\$}{\$}M^{\{}6{\}}Doc{\$}{\$}M6Doc, to enhance pretraining and provide a solid foundation for Indic layout
models. Our experiments demonstrate that fine-tuning existing English models on IndicDLP significantly boosts
performance, validating its effectiveness. Moreover, models trained on IndicDLP generalise well beyond Indic
layouts, making it a valuable resource for document digitisation. This work bridges gaps in scale, diversity, and
annotation granularity, driving inclusive and efficient document understanding.}  isbn     = {978-3-032-04614-7}

Acknowledgments

Assamese
Yuvaraj - Superchecker
Rondeep Bordoloi - Reviewer

Ajit Kumar Sarma - Annotator
Anjali Steephan - Annotator
Madhutrishna Chetia - Annotator
Riya Chutia - Annotator
Ruh Ullah Khan - Annotator

Bengali
Praneeth Reddy - Superchecker
Rondeep Bordoloi Reviewer

Gargi Mukherjee Kolley - Annotator
Madhumita Pal - Annotator
Priyanjana Banerjee - Annotator
Soupat Biswas - Annotator
Sushmita Pal - Annotator

English
Hemavardhini R - Superchecker
Yuvaraj - Superchecker
Ragavan S - Reviewer

Ghiridharan M G - Annotator
Munish Mangla - Annotator
Rubeena - Annotator
Vidhya J G - Annotator


Gujarati
Praneeth Reddy - Superchecker
Kaniz Fatema - Reviewer

Bhargav Bhatt - Annotator
Kinjal Joshi - Annotator
Naman Mehta - Annotator
Parth B - Annotator
Parthiv Makwana - Annotator
Shreya Parmar - Annotator
Vama Soni - Annotator

Hindi
Hemavardhini R - Superchecker
Puru Koli - Reviewer

Adiba Khan - Annotator
Anima Chetry - Annotator
Arati Giri - Annotator
Ashish Kumar Jha - Annotator
Bhakti Rai - Annotator
Furtengi Sherpa - Annotator
Keshav Prasad Sapkota - Annotator
Nilesh lagade - Annotator
Rushaid Abbas - Annotator

 

Kannada
Hemavardhini R - Superchecker
Ragavan S - Reviewer
Ramya - Reviewer
Sreejanani Sanke - Reviewer

Charulatha S - Annotator
Nandini Vijay - Annotator
Rajeshwari Lakkannavar - Annotator
Suma Girish - Annotator
Vidya Kulkarni - Annotator
Virat Kumar Pandey - Annotator


Malayalam
Neha Bandekar - Superchecker
Ramya - Reviewer
Swetha - Reviewer

ABHINAV P M - Annotator
Amal I C - Annotator
Nadha rashada S V - Annotator
SANJAY.R - Annotator
Sreelekshmi S - Annotator

Marathi
Neha Bandekar - Superchecker
Nikita Digraskar - Reviewer

Manjunath Renake - Annotator
Nitin Paranjape - Annotator
Sachin Deepak Londhe - Annotator
Tejas Vishnupant Akhare - Annotator

Odia
Neha Bandekar - Superchecker
Harihara Barik - Reviewer

Lalatendu Bidyadhar Das - Annotator
Rajat Kumar patra - Annotator
Satyabrat Badajena - Annotator
Sradhanjali Pradhan - Annotator


Punjabi
Yuvaraj - Superchecker
Saranpal Singh - Reviewer

HarvinderSingh GurmeetSingh Ragi - Annotator
Inderpreet - Annotator
Jaydeep Singh Shahu - Annotator
Lovepreet Singh - Annotator
Niharika Khanna - Annotator
Sukhpreet Kaur - Annotator

Tamil
Hemavardhini R - Superchecker
Swetha - Reviewer

Bensha Joyson - Annotator
N. Gana Priyan - Annotator
N.Indupriya - Annotator

Telugu
Praneeth Reddy - Superchecker
Sreejanani Sanke - Reviewer

Deepika Senapathi - Annotator
Ediga Sivakumar Goud - Annotator
Naresh Nune - Annotator
Vakkapati Divyasri - Annotator
Vani Bhaskar - Annotator


 

 

We would like to acknowledge the support from Indian Institute of Technology, Madras, India and International Institute of Information Technology Hyderabad, India.

 

 

MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding

 

Varun Paturkar, Shankar Gangisetty, C V Jawahar

IIIT Hyderabad

 

[Paper] [Code] [Dataset] [Project Webpage]

 


Motor IMG

 Fig: Comparison of traffic contexts and accident statistics across the Global North and South. Top Row: Four-wheelers dominating in the USA vs Two-wheelers in India. Bottom Row: Distribution of vehicles (two-wheeler vs four-wheeler) and fatal accidents across North and South.

Abstract

Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on rider behavior, however, lags far behind four-wheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we present the MOtorized TwO-wheeler Rider (MOTOR) dataset, the first large-scale, multi-view, multimodal resource dedicated to two-wheelers in dense, unstructured traffic. MOTOR com- prises 2,500 sequences (25+ hours) collected from 16 riders and integrates synchronized front, rear, and helmet videos, rider eye-gaze from wearable trackers, on-road audio, and telemetry (GPS, accelerometer, gyroscope). Rich annotations capture traf- fic context, rider state, 12 riding maneuvers spanning conven- tional and unconventional behaviors, and legality labels (Legal, Illegal, Unspecified). We benchmark rider behavior recognition and maneuver legality classification using state-of-the-art video action recognition backbones (CNN and Transformer-based), extended with multimodal fusion, and find that combining RGB, gaze, and telemetry consistently yields the best performance. MOTOR thus provides a unique foundation for advancing safety-critical understanding of two-wheeler riding. It offers the research community a benchmark to develop and evaluate models for behavior analysis, legality-aware prediction, and intelligent transportation systems. Dataset and code will be made publicly available.

 

 The MOTOR Dataset

 Table: Comparison of 4-wheeler and 2-wheeler behavior datasets. Our dataset is unique as it contains multi-modal, multi-view videos from ego-vehicle and helmet, eye gaze, as well as annotated conventional and unconventional behaviors, and legality-related riding scenarios. Note: CRB indicates conventional riding behaviors, and UCRB means unconventional riding behaviors.

Motor IMG2

 

Motor IMG3

 Fig: Data samples helmet-view. (a) Ego-rider weaves through dense, slow traffic, overtaking multiple vehicles across lanes. (b) Rider squeezes through a narrow gap between a bus and a car, narrowly avoiding the bus. (c) Rider rides in the wrong lane against dense oncoming traffic, disrupting flow. (d) Rider turns head fully toward a roadside building, diverting gaze from the road amid fast-moving traffic.

 

 Results

 Table: Rider Behavior Classification. Comparison of CNN and Transformer-based baselines on MOTOR dataset across different modality combinations.

Motor IMG4

 

 Table: Rider Legality Classification: CNN and Transformer-based baselines on MOTOR dataset across different modality combinations.

Motor IMG5

 

Citation

@in proceedings{motor2026icra, 
author = {Varun Paturkar, Shankar Gangisetty, and C. V. Jawahar}, title = {MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behaviour Understanding},
booktitle = {},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2026},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

 

 

More Articles …

  1. PedestrianQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction
  2. DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios
  3. Distilling What and Why: Enhancing Driver Intention Prediction with MLLM
  4. TexTAR – Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.