CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us
  • Login

Distilling What and Why: Enhancing Driver Intention Prediction with MLLMs

 

Sainithin Artham1, Avijit Dasgupta1, Shankar Gangisetty1, C V Jawahar1

IIIT Hyderabad

 

[Paper] [Code] [Dataset]

 


distil1

 Figure. Illustration of a driving scenario where the ADAS vehicle predicts a left lane change (what) to avoid slower traffic ahead (why). Existing DIP models lacking reasoning may miss such cues, while our framework jointly learns and distills both maneuver and explanation, improving decision quality.

Abstract

Predicting a drivers’ intent (e.g., turns, lane changes) is a critical capability for modern Advanced Driver Assistance Prev. DIP: Only Intent Prediction (What Maneuver) Our DIP: Intent Prediction (What Maneuver) with Explanation (Why) (what) Left Lane Change Systems (ADAS). While recent Multimodal Large Language Models (MLLMs) show promise in general vision-language tasks, we find that zeroshot MLLMs still lag behind domain specific approaches for Driver Intention Prediction (DIP). To address this, we introduce DriveXplain, a zero-shot framework based on MLLMs that leverages rich visual cues such as optical flow and road semantics to automatically generate both intention maneuver (what) and rich natural language explanations (why). These maneuver–explanation pairs are then distilled into a compact MLLM, which jointly learns to predict intentions and corresponding explanations. We show that incorporating explanations during training leads to substantial gains over models trained solely on labels, as distilling explanations instills reasoning capabilities by enabling the model to understand not only what decisions to make but also why those decisions are made. Comprehensive experiments across structured (Brain4Cars, AIDE) and unstructured (DAAD) datasets demonstrate that our approach achieves state-of-the-art results in DIP task, outperforming zero-shot and domain-specific baselines. We also present ablation studies to evaluate key design choices in our framework. This work sets a direction for more explainable and generalizable intention prediction in autonomous driving systems. We plan to release our codebase to support research.

 Methodology

distil2

 Figure. Our proposed framework for the DIP task. DriveXplain generates natural language explanations alongside maneuvers and Explanation Distillation distills these explanations into a single MLLM to enhance DIP performance at inference.

 

Key Highlights:

  • New Task: Understanding why a driver makes a decision is just as important as predicting what they’ll do next. 
  • DriveXplain Model: We introduce a zero-shot framework that enhances MLLMs for ADAS by embedding driving-specific context directly into their reasoning.
  •  Knowledge Distillation: To enable real-time, deployable solutions, we distill reasoning and decision-making capabilities from large MLLMs into smaller, efficient models — paving the way for explainable driving intelligence.

 Results

data distribution ego gaze explanations 

 Table DIP benchmark results. Performance comparison of Driving-specific VLM, General VLMs, Action Anticipation models, and our framework (DriveXplain, ED).Accuracy (Acc.) and F1 (%) on Brain4Cars AIDE, and DAAD datasets. Finetune indicates whether the model was fine-tuned (✓) or evaluated in a zero-shot (✗) setting. Bold and underline indicate the best and second-best results. 

 distil4

 Figure: Qualitative comparison of proposed framework, zero-shot Qwen2.5-VL, Dolphins across Brain4cars, AIDE, and DAAD datasets. We show manoeuvre prediction (what) and explanation (why), with attention heatmaps highlighting key regions. 

 

Citation

@in proceedings{vcbm2025daadx, 
author = {Sainithin Artham, Avijit Dasgupta, Shankar Gangisetty, and C. V. Jawahar}, title = {Distilling What and Why: Enhancing Driver Intention Prediction w ith MLLMs},
booktitle = {},
series = {},
volume = {},
pages = {},
publisher = {},
year = {2025},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

 

 

TexTAR – Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images

 

Rohan Kumar, Jyothi Swaroopa Jinka, Ravi Kiran Sarvadevabhatla

International Institute of Information Technology Hyderabad

 

[Paper]  [Code & Dataset] 

 

 

Abstract

Recognising textual attributes such as bold, italic, underline, and strikeout is essential for understanding text semantics, structure and visual presentation. Existing methods struggle with computational efficiency or adaptability in noisy, multilingual settings. To address this, we introduce TexTAR, a multi-task, context-aware Transformer for Textual Attribute Recognition (TAR). Our data-selection pipeline enhances context awareness, and our architecture employs a 2-D RoPE mechanism to incorporate spatial context for more accurate predictions. We also present MMTAD, a diverse multilingual dataset annotated with text attributes across real-world documents. TexTAR achieves state-of-the-art performance in extensive evaluations.

 

Textual Attributes in the Dataset

 

Textual Attributes

 

 

Data-selection Pipeline

Data-selection Pipeline 

 

Model Architecture

 

model architecture 

 

Comparison with State-of-the-Art Approaches

 

Comparison with State-of-the-Art Approaches 

 

Visualization of results for a subset of baselines and variants in comparison with TexTAR

 

Visualization of results 

Download the Dataset and Weights

 Model weights and the MMTAD testset can be downloaded from the link. To get access to the full dataset, please contact This email address is being protected from spambots. You need JavaScript enabled to view it.. .

 

Citation

@article{Kumar2025TexTAR, 
author = {Rohan Kumar and Jyothi Swaroopa Jinka and Ravi Kiran Sarvadevabhatla}, title = {TexTAR: Textual Attribute Recognition in Multi-domain and Multi-lingual Document Images},
booktitle = {International Conference on Document Analysis and Recognition, ICDAR},
year = {2025},
}

 

Acknowledgements

International Institute of Information Technology Hyderabad, India..

Contact

This email address is being protected from spambots. You need JavaScript enabled to view it..

This email address is being protected from spambots. You need JavaScript enabled to view it..

This email address is being protected from spambots. You need JavaScript enabled to view it..

 

 

Towards Scalable Sign Production: Leveraging Co-Articulated Gloss Dictionary for Fluid Sign Synthesis

 

Check back we will update soon.

Towards Safer and Understandable Driver Intention Predictions

 

Mukilan Karuppasamy1, Shankar Gangisetty1, Shyam Nanadan Rai2, Carlo Masone2, C V Jawahar1

1IIIT Hyderabad, India and Politecnico di Torino2, Italy

 

[arXiv] [Paper] [Code] [Dataset] [GitHub Webpage]

 

 

Illustration of an AD scenario for the DIP task. An AD system may intend to take a left turn while encountering a parked or slow-moving vehicle at the turn. Existing DIP models, lacking HCI understanding, might fail to anticipate the obstacle, leading to a potential collision. In contrast, an interpretable model can assess the situation through explainable interactions, adjust its manoeuvre, and safely navigate the turn. Towards this, we propose the VCBM model for DIP, incorporating one or more ego-vehicle explanations to enhance decision-making transparency.

Abstract

Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, largely due to recent advances in deep learning and AI. As the interactions between autonomous systems and humans grow, the interpretability of driving system decision-making processes becomes crucial for safe driving. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the explainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver’s decisions. These explanations are derived from both the driver’s eye-gaze and the ego-vehicle’s perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability compared to conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations.

The DAAD-X Dataset

Table. Comparison of datasets for visual place recognition.Our dataset is a subset of DAAD dataset and includes additional categories of explanations for multi-modal videos, encompassing both in-cabin (Aria eye-gaze) and out-cabin (ego-vehicle) perspectives.

 

DAAD-X Data Annotation

 

data distribution ego gaze explanations 

Figure. Driving video annotation statistics of DAAD-X dataset. Illustrating the distribution of (left) ego-vehicle explanations and (right) eye-gaze explanations across different maneuver actions. Detailed explanation categories are provided in the supplementary material. Better viewed in zoomed-in mode for clarity.

 

Video Concept Bottleneck Model (VCBM)

 

method overview 

Figure. Overall architecture of the proposed VCBM.  The dual video encoder first generate the spatio-temporal features (tubelet embeddings) for the video pair of ego-vehicle and gaze input sequence. These, tubelets are concatenated along the channel dimension and then fed into the proposed learnable token merging block to produces K-cluster centers based on composite distances. These clusters are then fed into localised concept bottleneck to disentangled and predict the maneuver label and one or more explanations to justify the maneuver decision.

 

Results

Table.Evaluation of baselines on DAAD-X dataset with (wB) and without (woB) bottleneck. Here, LTM indicates Learnable Token Merging.

daadx5

Table. Gaze modality input variants. Having the gaze cropped regions suits better than the usual way of overlaying gaze in DIP task. 

daadx6

gradcam 

Figure.GradCAM visualization on proposed method.At t = 1, the activations are scattered, but as time progresses to t = T , the CAM gradually refines and localise on important objects. This represents how humans make-decision, which evolves over time.

 

Citation

@in proceedings{vcbm2025daadx, 
author = {Mukilan Karuppasamy, Shankar Gangisetty, Shyam Nandan Rai, Carlo Masone, and C. V. Jawahar}, title = {Towards Safer and Understandable Driver Intention Prediction},
booktitle = {ICCV},
year = {2025},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

Pedestrian Intention and Trajectory Prediction in Unstructured Traffic Using IDD-PeD

 

Ruthvik Bokkasam1, Shankar Gangisetty1, A. H. Abdul Hafez2,

C V Jawahar1

IIIT Hyderabad, India1, India and King Faisal University2, Saudi Arabia

[Paper Link] [ Code & Dataset ] 

 iddped1

Fig. 1: Illustration of pedestrian intention and trajectory encountering various challenges within our unstructured traffic IDD-PeD dataset. The challenges include occlusions, signalized types, vehicle-pedestrian interactions, and illumination changes. Intent of C: Crossing with trajectory and NC: Not Crossing. 

Abstract

With the rapid advancements in autonomous driving, accurately predicting pedestrian behavior has become essential for ensuring safety in complex and unpredictable traffic conditions. The growing interest in this challenge highlights the need for comprehensive datasets that capture unstructured environments, enabling the development of more robust prediction models to enhance pedestrian safety and vehicle navigation. In this paper, we introduce an Indian driving pedestrian dataset designed to address the complexities of modeling pedestrian behavior in unstructured environments, such as illumination changes, occlusion of pedestrians, unsignalized scene types and vehicle-pedestrian interactions. The dataset provides high-level and detailed low-level comprehensive annotations focused on pedestrians requiring the ego-vehicle’s attention. Evaluation of the state-of-the-art intention prediction methods on our dataset shows a significant performance drop of up to 15%, while trajectory prediction methods underperform with an increase of up to 1208 MSE, defeating standard pedestrian datasets. Additionally, we present exhaustive quantitative and qualitative analysis of intention and trajectory baselines. We believe that our dataset will open new challenges for the pedestrian behavior research community to build robust models.

 

The IDD-PeD dataset

Table 1: Comparison of datasets for pedestrian behavior understanding. On-board diagnostics (OBD) provides ego-vehicle speed, acceleration, and GPS information. Group annotation represents the number of pedestrians moving together, in our dataset, about 1, 800 move individually while the rest move in groups of 2 or more. Interaction annotation refers to a label between ego-vehicle and pedestrian, where both influence each other’s movements and decisions. ✓and ✗ indicate the presence or absence of annotated data.

 

Fig. 2: Annotation instances and data statistics of IDD-PeD. Distribution of (a) frame-level ego-vehicle speeds, (b) pedestrian at signalized types such as crosswalk (C), signal (S), crosswalk and signal (CS), and absence of crosswalk and signal (NA), (c) pedestrians with track lengths at day and night, (d) frame-level different behavior analysis and traffic objects annotation, and (e) pedestrian occlusions.

Results

Pedestrian Intention Prediction (PIP) Baselines

Table 2: Evaluation of PIP baselines on JAAD, PIE, and our datasets.

Pedestrian Trajectory Prediction (PTP) Baselines

Table 3: Evaluation of PTP baselines on JAAD, PIE and our datasets. We report MSE results at 1.5s. “-” indicates no results as PIETraj needs explicit ego-vehicle speeds.

iddped6

Fig.: Qualitative evaluation of the best and worst PTP models on our dataset. Red: SGNet, Blue: PIETraj, Green: Ground truth, White: Observation period. To better illustrate and highlight key factors in PIP and PTP methods, a qualitative analysis will be provided in the supplementary video.

 

Citation

@inproceedings{idd2025ped, 
author = {Ruthvik Bokkasam, Shankar Gangisetty, A. H. Abdul Hafez, C. V. Jawahar},
title = {Pedestrian Intention and Trajectory Prediction in Unstructured Traffic Using IDD-PeD},
book title = {ICRA},
publisher = {IEEE},
year = {2025},
}

 

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

More Articles …

  1. Visual Place Recognition in Unstructured Driving Environments
  2. Early Anticipation of Driving Maneuvers
  3. Lip to speech in the wild
  4. IndicSTR12: A Dataset for Indic Scene Text Recognition
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.