Towards Safer and Understandable Driver Intention Predictions
[Paper] [Code & Dataset]

Illustration of an AD scenario for the DIP task. An AD system may intend to take a left turn while encountering a parked or slow-moving vehicle at the turn. Existing DIP models, lacking HCI understanding, might fail to anticipate the obstacle, leading to a potential collision. In contrast, an interpretable model can assess the situation through explainable interactions, adjust its manoeuvre, and safely navigate the turn. Towards this, we propose the VCBM model for DIP, incorporating one or more ego-vehicle explanations to enhance decision-making transparency.
Abstract
Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, largely due to recent advances in deep learning and AI. As the interactions between autonomous systems and humans grow, the interpretability of driving system decision-making processes becomes crucial for safe driving. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the explainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver’s decisions. These explanations are derived from both the driver’s eye-gaze and the ego-vehicle’s perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability compared to conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations.
The DAAD-X Dataset
Table. Comparison of datasets for visual place recognition.Our dataset is a subset of DAAD dataset and includes additional categories of explanations for multi-modal videos, encompassing both in-cabin (Aria eye-gaze) and out-cabin (ego-vehicle) perspectives.
DAAD-X Data Annotation
Figure. Driving video annotation statistics of DAAD-X dataset. Illustrating the distribution of (left) ego-vehicle explanations and (right) eye-gaze explanations across different maneuver actions. Detailed explanation categories are provided in the supplementary material. Better viewed in zoomed-in mode for clarity.
Video Concept Bottleneck Model (VCBM)
Figure. Overall architecture of the proposed VCBM. The dual video encoder first generate the spatio-temporal features (tubelet embeddings) for the video pair of ego-vehicle and gaze input sequence. These, tubelets are concatenated along the channel dimension and then fed into the proposed learnable token merging block to produces K-cluster centers based on composite distances. These clusters are then fed into localised concept bottleneck to disentangled and predict the maneuver label and one or more explanations to justify the maneuver decision.
Results
Table.Evaluation of baselines on DAAD-X dataset with (wB) and without (woB) bottleneck. Here, LTM indicates Learnable Token Merging.
Table. Gaze modality input variants. Having the gaze cropped regions suits better than the usual way of overlaying gaze in DIP task.
Figure.GradCAM visualization on proposed method.At t = 1, the activations are scattered, but as time progresses to t = T , the CAM gradually refines and localise on important objects. This represents how humans make-decision, which evolves over time.
Citation
@in proceedings{vcbm2025daadx,
author = {Mukilan Karuppasamy, Shankar Gangisetty, Shyam Nandan Rai, Carlo Masone, and C. V. Jawahar}, title = {Towards Safer and Understandable Driver Intention Prediction},
booktitle = {ICCV},
year = {2025},
}
Acknowledgements
This work is supported by iHub-Data and Mobility at IIIT Hyderabad.