CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Summer School 2026
    • Talks and Visits
    • Major Events
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us

PedestrianQA : A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

 

Naman Mishra, Shankar Gangisetty, C V Jawahar

IIIT Hyderabad

 

[Paper] [Code] [Dataset]

 


pedesIMG

 Fig: An illustration of an unstructured-traffic scenario where a pedestrian stands in the middle of the road in front of the ego-vehicle, attempting to cross the road. Unlike prior approaches that provide only predictions, our method predicts the intention and trajectory and generates supporting rationales.

Abstract

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision–language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural lan- guage reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as a question–answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling. The dataset and model will be made publicly available.

 

 PedestrianQA Dataset

 

pedesIMG

 Fig:Data generation pipeline: We first aggregate all ground-truth, human-annotated annotations from the constituent datasets into a unified metadata schema. We use generated VLM captions to enrich motion semantics using carefully designed pedestrian-motion prompts that target fine-grained cues. These captions are validated for format and appended to the metadata. We then construct a single instruction package containing: (i) a system prompt, (ii) task definitions for PIP and PTP, (iii) step-by-step guidance for producing structured, fine-grained rationales, (iv) a small set of in-context exemplars, (v) a compliance checklist for high-quality rationale generation, and (vi) the sequence-level metadata tables. This package is provided to claude-sonnet-4-20250514 LLM-API to generate triplets of questions, answers, and rationales.

 

 Results

pedesIMG

 

pedesIMG

 

pedesIMG

 

 Table: Rationale evaluation on the combined dataset, with Claude-Sonnet-4. Average scores (0–100) for Spatial Reasoning (SR), Temporal Reasoning (TR), Mathematical Reasoning (MR), Ego-Vehicle Reasoning (EVR), Scene-Context Reasoning (SCR), Final Destination Prediction (FDP), and Conclusion (C). ✓ indicates finetuned models, ✗ zero-shot. Bold shows best score per column; underline marks the second-best. Dolphins generates only a brief conclusion and does not generate category-specific rationales.   

pedesIMG 

Citation

@in proceedings{pedqa2026icra, 
author = {Naman Mishra, Shankar Gangisetty, and C. V. Jawahar},
title = {PedestrainQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction}, booktitle = {},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2026},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

 

 

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

 

Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta,
C V Jawahar

IIIT Hyderabad

 

[Paper] [Code] [Dataset]

 


DriveSafe IMG

 Fig: Previous works in driving scenarios primarily address risk perception but fall short of offering actionable safety guidance. Similarly, general-purpose MLLMs (Qwen2.5-VL, LLaVA-NeXT, VideoLLaMA3) are still unreliable in this regard. In contrast, our approach, DriveSafe, integrates risk assessment with clear, human-understandable safety suggestions.

Abstract

Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environ-ments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision–language tasks, our findings indicate that zero-shot MLLMs still under- perform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene under- standing that leverages structured natural language descrip- tions. Specifically, our method first generates spatially grounded captions enriched with multimodal context—including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption–risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. We plan to release our codebase to support future research.

 

 Methodology

DriveSafe2

 Fig: Our proposed DriveSafe framework for the caption generation and safety suggestion task in driving. We first derive contextual cues to guide caption generation, and then use the resulting captions for risk assessment and safety suggestion.

 

 Results

DriveSafe3

 Table: Performance comparison of Caption Generation and Risky Object Grounding across Existing Methods, General VLMs, and DriveSafe on the DRAMA dataset. 

 

DriveSafe4 

 Fig: Qualitative comparison of DriveSafe-ZeroShot, DriveSafe-Finetuned, and Qwen2.5-VL on three driving scenarios from the DRAMA dataset. Risky object grounding is shown with bounding boxes so are respective models with text highlighting, while generated captions and safety suggestions are marked as correct (green) or incorrect (red).

 

Citation

@in proceedings{drivesafe2026icra, 
author = {Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta, and C. V. Jawahar}, title = {DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios},
booktitle = {},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2026},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

 

 

Distilling What and Why: Enhancing Driver Intention Prediction with MLLMs

 

Sainithin Artham1, Avijit Dasgupta1, Shankar Gangisetty1, C V Jawahar1

IIIT Hyderabad

 

[Paper] [Code]

 


distil1

 Figure. Illustration of a driving scenario where the ADAS vehicle predicts a left lane change (what) to avoid slower traffic ahead (why). Existing DIP models lacking reasoning may miss such cues, while our framework jointly learns and distills both maneuver and explanation, improving decision quality.

Abstract

Predicting a drivers’ intent (e.g., turns, lane changes) is a critical capability for modern Advanced Driver Assistance Prev. DIP: Only Intent Prediction (What Maneuver) Our DIP: Intent Prediction (What Maneuver) with Explanation (Why) (what) Left Lane Change Systems (ADAS). While recent Multimodal Large Language Models (MLLMs) show promise in general vision-language tasks, we find that zeroshot MLLMs still lag behind domain specific approaches for Driver Intention Prediction (DIP). To address this, we introduce DriveXplain, a zero-shot framework based on MLLMs that leverages rich visual cues such as optical flow and road semantics to automatically generate both intention maneuver (what) and rich natural language explanations (why). These maneuver–explanation pairs are then distilled into a compact MLLM, which jointly learns to predict intentions and corresponding explanations. We show that incorporating explanations during training leads to substantial gains over models trained solely on labels, as distilling explanations instills reasoning capabilities by enabling the model to understand not only what decisions to make but also why those decisions are made. Comprehensive experiments across structured (Brain4Cars, AIDE) and unstructured (DAAD) datasets demonstrate that our approach achieves state-of-the-art results in DIP task, outperforming zero-shot and domain-specific baselines. We also present ablation studies to evaluate key design choices in our framework. This work sets a direction for more explainable and generalizable intention prediction in autonomous driving systems. We plan to release our codebase to support research.

 Methodology

distil2

 Figure. Our proposed framework for the DIP task. DriveXplain generates natural language explanations alongside maneuvers and Explanation Distillation distills these explanations into a single MLLM to enhance DIP performance at inference.

 

Key Highlights:

  • New Task: Understanding why a driver makes a decision is just as important as predicting what they’ll do next. 
  • DriveXplain Model: We introduce a zero-shot framework that enhances MLLMs for ADAS by embedding driving-specific context directly into their reasoning.
  •  Knowledge Distillation: To enable real-time, deployable solutions, we distill reasoning and decision-making capabilities from large MLLMs into smaller, efficient models — paving the way for explainable driving intelligence.

 Results

data distribution ego gaze explanations 

 Table DIP benchmark results. Performance comparison of Driving-specific VLM, General VLMs, Action Anticipation models, and our framework (DriveXplain, ED).Accuracy (Acc.) and F1 (%) on Brain4Cars AIDE, and DAAD datasets. Finetune indicates whether the model was fine-tuned (✓) or evaluated in a zero-shot (✗) setting. Bold and underline indicate the best and second-best results. 

 distil4

 Figure: Qualitative comparison of proposed framework, zero-shot Qwen2.5-VL, Dolphins across Brain4cars, AIDE, and DAAD datasets. We show manoeuvre prediction (what) and explanation (why), with attention heatmaps highlighting key regions. 

 

Citation

@in proceedings{vcbm2025daadx, 
author = {Sainithin Artham, Avijit Dasgupta, Shankar Gangisetty, and C. V. Jawahar}, title = {Distilling What and Why: Enhancing Driver Intention Prediction w ith MLLMs},
booktitle = {},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2025},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

 

 

TexTAR – Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images

 

Rohan Kumar, Jyothi Swaroopa Jinka, Ravi Kiran Sarvadevabhatla

International Institute of Information Technology Hyderabad

 

[Paper]  [Code & Dataset] 

 

 

Abstract

Recognising textual attributes such as bold, italic, underline, and strikeout is essential for understanding text semantics, structure and visual presentation. Existing methods struggle with computational efficiency or adaptability in noisy, multilingual settings. To address this, we introduce TexTAR, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR). Our data-selection pipeline enhances context awareness, and our architecture employs a 2-D RoPE mechanism to incorporate spatial context for more accurate predictions. We also present MMTAD, a diverse multilingual dataset annotated with text attributes across real-world documents. TexTAR achieves state-of-the-art performance in extensive evaluations.

 

Textual Attributes in the Dataset

 

Textual Attributes

 

 

Data-selection Pipeline

Data-selection Pipeline 

 

Model Architecture

 

model architecture 

 

Comparison with State-of-the-Art Approaches

 

Comparison with State-of-the-Art Approaches 

 

Visualization of results for a subset of baselines and variants in comparison with TexTAR

 

Visualization of results 

Download the Dataset and Weights

 Model weights and the MMTAD testset can be downloaded from the link. To get access to the full dataset, please contact This email address is being protected from spambots. You need JavaScript enabled to view it.. .

 

Citation

@article{Kumar2025TexTAR, 
author = {Rohan Kumar and Jyothi Swaroopa Jinka and Ravi Kiran Sarvadevabhatla}, title = {TexTAR: Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images},
booktitle = {International Conference on Document Analysis and Recognition, ICDAR},
year = {2025},
}

 

Acknowledgements

International Institute of Information Technology Hyderabad, India..

Contact

This email address is being protected from spambots. You need JavaScript enabled to view it..

This email address is being protected from spambots. You need JavaScript enabled to view it..

This email address is being protected from spambots. You need JavaScript enabled to view it..

 

 

Towards Scalable Sign Production: Leveraging Co-Articulated Gloss Dictionary for Fluid Sign Synthesis

 

Check back we will update soon.

More Articles …

  1. What is there in an Indian Thali?
  2. Towards Safer and Understandable Driver Intention Prediction
  3. Pedestrian Intention and Trajectory Prediction in Unstructured Traffic Using IDD-PeD
  4. Visual Place Recognition in Unstructured Driving Environments
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.