Towards Scalable Sign Production: Leveraging Co-Articulated Gloss Dictionary for Fluid Sign Synthesis

Check back we will update soon.

What is there in an Indian Thali?

Yash Arora, Aditya Arun, C V Jawahar

IIIT Hyderabad, India IIIT Hyderabad, India

16th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2025)

[Paper] [Supplementary]

[Indian Thali Dataset] [
Weight Estimation Dataset
]

From Thali to Nutritional Summary: Our automated pipeline takes a single thali image (a) and performs segmentation with classification (b), raw weight estimation (c), and finally, a complete nutritional summary (d).

Animated demo of the Food Scanner pipeline from image to nutrition.

Project Teaser: Animated demonstration of our pipeline, showing the transformation from a thali image to a nutritional summary.

Abstract

Automated dietary monitoring solutions face significant challenges with culturally diverse, multi-dish meals, where traditional approaches fail. Most systems are tailored to western foods and struggle with the overlapping textures and cultural specificity of dishes like Indian Thalis, which contain 5-10 distinct items. We present Food Scanner, a novel, end-to-end pipeline with retraining-free segmentation & prototype-based classification, plus a lightweight trainable weight-regression head for automated nutrition estimation from a single image. Our approach requires no class-specific retraining, enabling rapid adaptation to new dishes. To enable this study, we contribute two datasets: a multi-view Indian Thali Dataset (ITD) of 796 plates (7,900 images) and a Weight Estimation Dataset (WED) of 267 plates (1,394 images) with gram-level annotations. Our system offers a scalable, culturally adaptable solution for diverse food environments.

Why is an Indian Thali so Challenging?

Automated food analysis of Indian cuisine is notoriously difficult due to several unique visual challenges that break standard recognition models:

Occlusion and Overlapping: Dishes are served in close proximity, with items like rice often hidden under curries (e.g., 'Dal mixed with rice').

Deformable Structures: Foods like 'Roti' can be folded or flat, drastically changing their visual appearance.

High Inter-Class Similarity: Many different dishes share similar colors and textures, making them hard to distinguish (e.g., 'Sambar' vs. 'Dal Makhani' vs. 'Tomato Rasam').

High Intra-Class Variation: The *same* dish can look completely different based on preparation, ingredients, or garnish (e.g., different preparations of 'Dal' or 'Baingan Curry').

Visual challenges in Indian cuisine, showing occlusion, deformable structures, and visual similarity between and within classes.

The Food Scanner Pipeline

Our system is a modular, four-stage pipeline designed for scalability and real-world deployment without needing constant retraining.

Architectural diagram of the Food Scanner pipeline.

The complete Food Scanner architecture, from segmentation with GroundedSAM2 to kNN matching and final weight/calorie estimation.

1. Zero-Shot Segmentation

We use GroundedSAM2 (Grounded-SAM-2) with a single, generic prompt "food" to generate class-agnostic masks for all edible items on the plate. This requires no fine-tuning or per-class annotation.

2. Retraining-Free Classification

Each food segment is identified using a prototype-based k-Nearest Neighbors (kNN) matching system. We use a Perception Encoder (PE) to create visual embeddings for each segment. These are then compared (using cosine similarity) against a "prototype gallery" of known dishes for that day's menu. This allows new dishes to be added "on-the-fly" just by adding a few prototype images, completely eliminating the need for retraining.

3. Lightweight Weight Regression

Once labeled, each food region is passed to our FusionWeightNet (built on a ResNet-50 backbone) to predict its weight in grams. This is the only trainable component in our pipeline, and it achieves a mean absolute error of ~14-15 grams.

4. Nutritional Computation

The predicted gram weights are converted into calories, protein, carbohydrates, and fats using a pre-compiled nutritional lookup table, providing a complete per-dish and per-plate summary.

Data Capture Setup

To build our datasets, we designed a multi-view capture rig to photograph each thali from 5 distinct viewpoints. This setup ensures we capture diverse visual cues, which is crucial for training a robust weight estimation model that understands food volume from 2D images.

Multi-view capture setup for the Weight Estimation Dataset.

Our multi-view capture setup (left) and camera position schematic (right) used for both datasets.

Our Datasets

A key contribution of this work is the creation of two large-scale, multi-view datasets for Indian food analysis.

Indian Thali Dataset (ITD): The first dataset for segmentation, containing 7,900 images from 796 unique plates, covering 50 dishes. Each image is annotated with dense, pixel-level masks.
Weight Estimation Dataset (WED): The second dataset for weight regression, containing 1,394 images from 267 plates, covering 41 dishes. Each food item on the plate has a corresponding precise gram-level weight measurement.

Sample 8x8 grid showing images from the <b>Indian Thali Dataset

Sample 8x8 grid showing images from the Indian Thali Dataset (left 4 columns) and Weight Estimation Dataset (right 4 columns).

A video walkthrough demonstrating the scale, diversity, and annotation quality of our datasets.

Real-World Deployment: The Food Scanner Kiosk

We validated our pipeline by deploying it as a real-world "Food Scanner Kiosk" in a university mess hall at IIIT Hyderabad. Users can place their thali in the kiosk, which captures an image and provides an immediate, on-screen nutritional breakdown.

The deployed Food Scanner Kiosk in a university cafeteria.

The deployed system: (a) A sample thali input, (b) The self-service Food Scanner Kiosk, and (c) The final nutritional analysis output displayed to the user.

Live Kiosk Demo: Watch the deployed Food Scanner Kiosk in action, providing instant nutritional analysis from a real thali image.

Contact: This email address is being protected from spambots. You need JavaScript enabled to view it. | This email address is being protected from spambots. You need JavaScript enabled to view it. | This email address is being protected from spambots. You need JavaScript enabled to view it.

Towards Safer and Understandable Driver Intention Predictions

Mukilan Karuppasamy¹, Shankar Gangisetty¹, Shyam Nanadan Rai², Carlo Masone², C V Jawahar¹

¹IIIT Hyderabad, India and Politecnico di Torino², Italy

[Paper] [Code & Dataset]

Illustration of an AD scenario for the DIP task. An AD system may intend to take a left turn while encountering a parked or slow-moving vehicle at the turn. Existing DIP models, lacking HCI understanding, might fail to anticipate the obstacle, leading to a potential collision. In contrast, an interpretable model can assess the situation through explainable interactions, adjust its manoeuvre, and safely navigate the turn. Towards this, we propose the VCBM model for DIP, incorporating one or more ego-vehicle explanations to enhance decision-making transparency.

Abstract

Autonomous driving (AD) systems are becoming increasingly capable of handling complex tasks, largely due to recent advances in deep learning and AI. As the interactions between autonomous systems and humans grow, the interpretability of driving system decision-making processes becomes crucial for safe driving. Successful human-machine interaction requires understanding the underlying representations of the environment and the driving task, which remains a significant challenge in deep learning-based systems. To address this, we introduce the task of interpretability in maneuver prediction before they occur for driver safety, i.e., driver intent prediction (DIP), which plays a critical role in AD systems. To foster research in interpretable DIP, we curate the explainable Driving Action Anticipation Dataset (DAAD-X), a new multimodal, ego-centric video dataset to provide hierarchical, high-level textual explanations as causal reasoning for the driver’s decisions. These explanations are derived from both the driver’s eye-gaze and the ego-vehicle’s perspective. Next, we propose Video Concept Bottleneck Model (VCBM), a framework that generates spatio-temporally coherent explanations inherently, without relying on post-hoc techniques. Finally, through extensive evaluations of the proposed VCBM on DAAD-X dataset, we demonstrate that transformer-based models exhibit greater interpretability compared to conventional CNN-based models. Additionally, we introduce a multilabel t-SNE visualization technique to illustrate the disentanglement and causal correlation among multiple explanations.

The DAAD-X Dataset

Table. Comparison of datasets for visual place recognition.Our dataset is a subset of DAAD dataset and includes additional categories of explanations for multi-modal videos, encompassing both in-cabin (Aria eye-gaze) and out-cabin (ego-vehicle) perspectives.

DAAD-X Data Annotation

data distribution ego gaze explanations

Figure. Driving video annotation statistics of DAAD-X dataset. Illustrating the distribution of (left) ego-vehicle explanations and (right) eye-gaze explanations across different maneuver actions. Detailed explanation categories are provided in the supplementary material. Better viewed in zoomed-in mode for clarity.

Video Concept Bottleneck Model (VCBM)

method overview

Figure. Overall architecture of the proposed VCBM. The dual video encoder first generate the spatio-temporal features (tubelet embeddings) for the video pair of ego-vehicle and gaze input sequence. These, tubelets are concatenated along the channel dimension and then fed into the proposed learnable token merging block to produces K-cluster centers based on composite distances. These clusters are then fed into localised concept bottleneck to disentangled and predict the maneuver label and one or more explanations to justify the maneuver decision.

Results

Table.Evaluation of baselines on DAAD-X dataset with (wB) and without (woB) bottleneck. Here, LTM indicates Learnable Token Merging.

daadx5

Table. Gaze modality input variants. Having the gaze cropped regions suits better than the usual way of overlaying gaze in DIP task.

daadx6

gradcam

Figure.GradCAM visualization on proposed method.At t = 1, the activations are scattered, but as time progresses to t = T , the CAM gradually refines and localise on important objects. This represents how humans make-decision, which evolves over time.

Citation

@in proceedings{vcbm2025daadx, 

     author = {Mukilan Karuppasamy, Shankar Gangisetty, Shyam Nandan Rai, Carlo Masone, and C. V. Jawahar}, 
     title = {Towards Safer and Understandable Driver Intention Prediction}, 

     booktitle = {ICCV}, 

     year = {2025}, 
}

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

Pedestrian Intention and Trajectory Prediction in Unstructured Traffic Using IDD-PeD

Ruthvik Bokkasam¹, Shankar Gangisetty¹, A. H. Abdul Hafez²,

C V Jawahar¹

IIIT Hyderabad, India¹, India and King Faisal University², Saudi Arabia

[Paper Link] [ Code & Dataset ]

iddped1

Fig. 1: Illustration of pedestrian intention and trajectory encountering various challenges within our unstructured traffic IDD-PeD dataset. The challenges include occlusions, signalized types, vehicle-pedestrian interactions, and illumination changes. Intent of C: Crossing with trajectory and NC: Not Crossing.

Abstract

With the rapid advancements in autonomous driving, accurately predicting pedestrian behavior has become essential for ensuring safety in complex and unpredictable traffic conditions. The growing interest in this challenge highlights the need for comprehensive datasets that capture unstructured environments, enabling the development of more robust prediction models to enhance pedestrian safety and vehicle navigation. In this paper, we introduce an Indian driving pedestrian dataset designed to address the complexities of modeling pedestrian behavior in unstructured environments, such as illumination changes, occlusion of pedestrians, unsignalized scene types and vehicle-pedestrian interactions. The dataset provides high-level and detailed low-level comprehensive annotations focused on pedestrians requiring the ego-vehicle’s attention. Evaluation of the state-of-the-art intention prediction methods on our dataset shows a significant performance drop of up to 15%, while trajectory prediction methods underperform with an increase of up to 1208 MSE, defeating standard pedestrian datasets. Additionally, we present exhaustive quantitative and qualitative analysis of intention and trajectory baselines. We believe that our dataset will open new challenges for the pedestrian behavior research community to build robust models.

The IDD-PeD dataset

Table 1: Comparison of datasets for pedestrian behavior understanding. On-board diagnostics (OBD) provides ego-vehicle speed, acceleration, and GPS information. Group annotation represents the number of pedestrians moving together, in our dataset, about 1, 800 move individually while the rest move in groups of 2 or more. Interaction annotation refers to a label between ego-vehicle and pedestrian, where both influence each other’s movements and decisions. ✓and ✗ indicate the presence or absence of annotated data.

Fig. 2: Annotation instances and data statistics of IDD-PeD. Distribution of (a) frame-level ego-vehicle speeds, (b) pedestrian at signalized types such as crosswalk (C), signal (S), crosswalk and signal (CS), and absence of crosswalk and signal (NA), (c) pedestrians with track lengths at day and night, (d) frame-level different behavior analysis and traffic objects annotation, and (e) pedestrian occlusions.

Results

Pedestrian Intention Prediction (PIP) Baselines

Table 2: Evaluation of PIP baselines on JAAD, PIE, and our datasets.

Pedestrian Trajectory Prediction (PTP) Baselines

Table 3: Evaluation of PTP baselines on JAAD, PIE and our datasets. We report MSE results at 1.5s. “-” indicates no results as PIETraj needs explicit ego-vehicle speeds.

iddped6

Fig.: Qualitative evaluation of the best and worst PTP models on our dataset. Red: SGNet, Blue: PIETraj, Green: Ground truth, White: Observation period. To better illustrate and highlight key factors in PIP and PTP methods, a qualitative analysis will be provided in the supplementary video.

Citation

@inproceedings{idd2025ped, 

     author = {Ruthvik Bokkasam, Shankar Gangisetty, A. H. Abdul Hafez, C. V. Jawahar}, 

     title = {Pedestrian Intention and Trajectory Prediction in Unstructured Traffic Using IDD-PeD}, 

     book title = {ICRA}, 

     publisher = {IEEE}, 

     year = {2025}, 
}

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

Visual Place Recognition in Unstructured Driving Environments

Utkarsh Rai¹, Shankar Gangisetty¹, A. H. Abdul Hafez¹,

Anbumani Subramanian¹, C V Jawahar¹

¹IIIT Hyderabad, India

[ Paper ] [ Code ] [ Dataset ]

001

Fig. 1: Illustration of visual place recognition encountering various challenges across the three routes within our unstructured driving VPR dataset. The challenges include occlusions, traffic density changes, viewpoint changes, and variations in illumination.

Abstract

The problem of determining geolocation through visual inputs, known as Visual Place Recognition (VPR), has attracted significant attention in recent years owing to its potential applications in autonomous self-driving systems. The rising interest in these applications poses unique challenges, particularly the necessity for datasets encompassing unstructured environmental conditions to facilitate the development of robust VPR methods. In this paper, we address the VPR challenges by proposing an Indian driving VPR dataset that caters to the semantic diversity of unstructured driving environments like occlusions due to dynamic environments, variations in traffic density, viewpoint variability, and variability in lighting conditions. In unstructured driving environments, GPS signals are unreliable, often affecting the vehicle to accurately determine location. To address this challenge, we develop an interactive image-to-image tagging annotation tool to annotate large datasets with ground truth annotations for VPR training. Evaluation of the state-of-the-art methods on our dataset shows a significant performance drop of up to 15%, defeating a large number of standard VPR datasets. We also provide an exhaustive quantitative and qualitative experimental analysis of frontal-view, multi-view, and sequence-matching methods. We believe that our dataset will open new challenges for the VPR research community to build robust models. The dataset, code, and tool will be released on acceptance.

The IDD-VPR dataset

Data Capture and Collection

Table 1: Comparison of datasets for visual place recognition. Total length is the coverage multiplied by the number of times each route was traversed. Time span is from the first recording of a route to the last recording.

002

003

Fig. 2: Data collection map for the three routes. The map displays the actual routes (in blue color) taken and superimposed with maximum GPS drift due to signal loss (dashed lines in red color). This GPS inconsistency required manual correction.

Data Annotation

During data capture ensuring consistency and error-free GPS reading for all three route traversals was challenging as shown in Fig. 2. Through our image-to-image tagging annotation process, we ensured the consistency of each location being tagged with the appropriate GPS readings, maintaining a mean error of less than 10 meters. We developed an image-to-image matching annotation tool as presented in Fig. 3.

009

Fig. 3: Image-to-image annotation tool for (query, reference) pair matching by the annotators with GPS tagging.

004

Fig. 4: Data capture span. Left: based on months and Right: diversity of samples encompasses different weather conditions, including overcast (Sep’23, Oct’23), winter (Dec’23, Jan’24), and spring (Feb’24).

Results

Frontal-View Place Recognition

Table 2: Evaluation of baselines on Frontal-View datasets inclusive of IDD-VPR: Report overall recall@1, split by utilized backbone and descriptor dimension of 4096-D.

006

Multi-View Place Recognition

Table 2: Evaluation of baselines on Multi-View datasets inclusive of IDD-VPR: Report overall recall@1, split by utilized backbone and descriptor dimension of 4096-D.

007

008

Fig. 5: Qualitative comparison of baselines on our dataset. The first column comprises query images of unstructured driving environmental challenges, while the subsequent columns showcase the retrieved images for each of the methods. Green: true positive; Red: false positive.

Citation

@inproceedings{idd2024vpr,
  author       = {Utkarsh Rai, Shankar Gangisetty, A. H. Abdul Hafez, Anbumani Subramanian and
                  C. V. Jawahar},
  title        = {Visual Place Recognition in Unstructured Driving Environments},
  booktitle    = {IROS},
  pages        = {10724--10731},
  publisher    = {IEEE},
  year         = {2024}
}

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

Towards Scalable Sign Production: Leveraging Co-Articulated Gloss Dictionary for Fluid Sign Synthesis

What is there in an Indian Thali?

Yash Arora, Aditya Arun, C V Jawahar

IIIT Hyderabad, India IIIT Hyderabad, India

16th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2025)

[Paper] [Supplementary]

[Indian Thali Dataset] [ Weight Estimation Dataset ]

Abstract

Why is an Indian Thali so Challenging?

The Food Scanner Pipeline

1. Zero-Shot Segmentation

2. Retraining-Free Classification

3. Lightweight Weight Regression

4. Nutritional Computation

Data Capture Setup

Our Datasets

Real-World Deployment: The Food Scanner Kiosk

Towards Safer and Understandable Driver Intention Predictions

Mukilan Karuppasamy1, Shankar Gangisetty1, Shyam Nanadan Rai2, Carlo Masone2, C V Jawahar1

1IIIT Hyderabad, India and Politecnico di Torino2, Italy

[Paper] [Code & Dataset]

Abstract

The DAAD-X Dataset

DAAD-X Data Annotation

Video Concept Bottleneck Model (VCBM)

Results

Citation

Acknowledgements

Pedestrian Intention and Trajectory Prediction in Unstructured Traffic Using IDD-PeD

Ruthvik Bokkasam1, Shankar Gangisetty1, A. H. Abdul Hafez2,

C V Jawahar1

IIIT Hyderabad, India1, India and King Faisal University2, Saudi Arabia

[Paper Link] [ Code & Dataset ]

Abstract

The IDD-PeD dataset

Results

Pedestrian Intention Prediction (PIP) Baselines

Pedestrian Trajectory Prediction (PTP) Baselines

Citation

Acknowledgements

Visual Place Recognition in Unstructured Driving Environments

Utkarsh Rai1, Shankar Gangisetty1, A. H. Abdul Hafez1,

Anbumani Subramanian1, C V Jawahar1

1IIIT Hyderabad, India

[ Paper ] [ Code ] [ Dataset ]

Abstract

The IDD-VPR dataset

Data Capture and Collection

Data Annotation

Results

Frontal-View Place Recognition

Multi-View Place Recognition

Citation

Acknowledgements

More Articles …

[Indian Thali Dataset] [
Weight Estimation Dataset
]

Mukilan Karuppasamy¹, Shankar Gangisetty¹, Shyam Nanadan Rai², Carlo Masone², C V Jawahar¹

¹IIIT Hyderabad, India and Politecnico di Torino², Italy

Ruthvik Bokkasam¹, Shankar Gangisetty¹, A. H. Abdul Hafez²,

C V Jawahar¹

IIIT Hyderabad, India¹, India and King Faisal University², Saudi Arabia

Utkarsh Rai¹, Shankar Gangisetty¹, A. H. Abdul Hafez¹,

Anbumani Subramanian¹, C V Jawahar¹

¹IIIT Hyderabad, India