CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us
  • Login

Visual Place Recognition in Unstructured Driving Environments

Utkarsh Rai1, Shankar Gangisetty1, A. H. Abdul Hafez1,

Anbumani Subramanian1, C V Jawahar1

1IIIT Hyderabad, India

 

[ Paper ]     [ Code ]     [ Dataset ] 

001

Fig. 1: Illustration of visual place recognition encountering various challenges across the three routes within our unstructured driving VPR dataset. The challenges include occlusions, traffic density changes, viewpoint changes, and variations in illumination.

Abstract

The problem of determining geolocation through visual inputs, known as Visual Place Recognition (VPR), has attracted significant attention in recent years owing to its potential applications in autonomous self-driving systems. The rising interest in these applications poses unique challenges, particularly the necessity for datasets encompassing unstructured environmental conditions to facilitate the development of robust VPR methods. In this paper, we address the VPR challenges by proposing an Indian driving VPR dataset that caters to the semantic diversity of unstructured driving environments like occlusions due to dynamic environments, variations in traffic density, viewpoint variability, and variability in lighting conditions.  In unstructured driving environments, GPS signals are unreliable, often affecting the vehicle to accurately determine location. To address this challenge, we develop an interactive image-to-image tagging annotation tool to annotate large datasets with ground truth annotations for VPR training. Evaluation of the state-of-the-art methods on our dataset shows a significant performance drop of up to 15%, defeating a large number of standard VPR datasets. We also provide an exhaustive quantitative and qualitative experimental analysis of frontal-view, multi-view, and sequence-matching methods. We believe that our dataset will open new challenges for the VPR research community to build robust models. The dataset, code, and tool will be released on acceptance.

The IDD-VPR dataset

Data Capture and Collection

Table 1: Comparison of datasets for visual place recognition. Total length is the coverage multiplied by the number of times each route was traversed. Time span is from the first recording of a route to the last recording.

002

003

Fig. 2: Data collection map for the three routes. The map displays the actual routes (in blue color) taken and superimposed with maximum GPS drift due to signal loss (dashed lines in red color). This GPS inconsistency required manual correction.

Data Annotation

During data capture ensuring consistency and error-free GPS reading for all three route traversals was challenging as shown in Fig. 2. Through our image-to-image tagging annotation process, we ensured the consistency of each location being tagged with the appropriate GPS readings, maintaining a mean error of less than 10 meters. We developed an image-to-image matching annotation tool as presented in Fig. 3.

 

009

 

 Fig. 3: Image-to-image annotation tool for (query, reference) pair matching by the annotators with GPS tagging.

004

Fig. 4: Data capture span. Left: based on months and Right: diversity of samples encompasses different weather conditions, including overcast (Sep’23, Oct’23), winter (Dec’23, Jan’24), and spring (Feb’24).

 

 

 

Results

Frontal-View Place Recognition

Table 2: Evaluation of baselines on Frontal-View datasets inclusive of IDD-VPR: Report overall recall@1, split by utilized backbone and descriptor dimension of 4096-D.

006

 

Multi-View Place Recognition

Table 2: Evaluation of baselines on Multi-View datasets inclusive of IDD-VPR: Report overall recall@1, split by utilized backbone and descriptor dimension of 4096-D.

007

 

008

Fig. 5: Qualitative comparison of baselines on our dataset. The first column comprises query images of unstructured driving environmental challenges, while the subsequent columns showcase the retrieved images for each of the methods. Green: true positive; Red: false positive. 

Citation

@inproceedings{idd2024vpr,
  author       = {Utkarsh Rai, Shankar Gangisetty, A. H. Abdul Hafez, Anbumani Subramanian and
                  C. V. Jawahar},
  title        = {Visual Place Recognition in Unstructured Driving Environments},
  booktitle    = {IROS},
  pages        = {10724--10731},
  publisher    = {IEEE},
  year         = {2024}
}

Acknowledgements

This work is supported by iHub-Data and Mobility at IIIT Hyderabad.

Early Anticipation of Driving Maneuvers

Abdul Wasi1, Shankar Gangisetty1, Shyam Nanadan2, C V Jawahar1

IIIT Hyderabad1, India and Politecnico di Torino2, Italy

[Paper]       [Code]     [Dataset]

 

daad1

Fig.: Overview of DAAD dataset for ADM task. Left: Shows previous datasets containing maneuver videos from their initiation to their execution (DM), whereas our DAAD dataset features longer video sequences providing prior context (BM), which proves beneficial for early maneuver anticipation. Right: Illustrates the multi-view and multi-modality (Gaze through the egocentric view) in DAAD for ADM.

Abstract

Prior works have addressed the problem of driver intention prediction (DIP) to identify maneuvers post their onset. On the other hand, early anticipation is equally important in scenarios that demand a preemptive response before a maneuver begins. However, there is no prior work aimed at addressing the problem of driver action anticipation before the onset of the maneuver, limiting the ability of the advanced driver assistance system (ADAS) for early maneuver anticipation. In this work, we introduce Anticipating Driving Maneuvers (ADM), a new task that enables driver action anticipation before the onset of the maneuver. To initiate research in this new task, we curate Driving Action Anticipation Dataset, DAAD, that is multi-view: in- and out-cabin views in dense and heterogeneous scenarios, and multimodal: egocentric view and gaze information. The dataset captures sequences both before the initiation and during the execution of a maneuver. During dataset collection, we also ensure to capture wide diversity in traffic scenarios, weather and illumination, and driveway conditions. Next, we propose a strong baseline based on transformer architecture to effectively model multiple views and modalities over longer video lengths. We benchmark the existing DIP methods on DAAD and related datasets. Finally, we perform an ablation study showing the effectiveness of multiple views and modalities in maneuver anticipation.

The DAAD dataset

 

daad2

 Fig.: Data samples. DAAD in comparison to Brain4Cars, VIENA2 and HDD datasets. DAAD exhibits great diversity in various driving conditions (traffic density, day/night, weather, type of routes) across different driving maneuvers. 

Results

 

daad3

Fig.: Effect of time-to-maneuver. (a) Accuracy over time for different driving datasets on CEMFormer (with ViT encoder). We conducted three separate experiments for the DAAD dataset. (i) DAAD-DM: Training and testing only on the maneuver sequences (DM). (ii) DAAD-Full: Training and testing on the whole video. (iii) DAAD-BM: Training on a portion of the video captured before the onset of maneuver (BM) and testing on the whole video; (b) Accuracy over time for our dataset on ViT, MViT, MViTv2 encoders, and the proposed method (M2MVT). Here, t is the time of onset of the maneuver.

Dataset

  • Driver View
  • Front View
  • Gaze View
  • Left View
  • Rear View
  • Right View

Citation

@inproceedings{adm2024daad,
  author       = {Abdul Wasi, Shankar Gangisetty, Shyam Nandan Rai and C. V. Jawahar},
  title        = {Early Anticipation of Driving Maneuvers},
  booktitle    = {ECCV (70)},
  series       = {Lecture Notes in Computer Science},
  volume       = {15128},
  pages        = {152--169},
  publisher    = {Springer},
  year         = {2024}
}

 

Acknowledgments

This work is supported by iHub-Data and mobility at IIIT Hyderabad.

 

 

TBD

IndicSTR12: A Dataset for Indic Scene Text Recognition

Harsh Lunia, Ajoy Mondal, C V Jawahar

[Paper]    [Code]     [Synthetic Dataset]      [Real Dataset] [Poster]     

ICDAR , 2023

 

 

Abstract

real synthetic

Samples from IndicSTR12 Dataset: Real word-images (left); Synthetic word-images (right)

 

We present a comprehensive dataset comprising 12 major Indian languages, including Assamese, Bengali, Odia, Marathi, Hindi, Kannada, Urdu, Telugu, Malayalam, Tamil, Gujarati, and Punjabi. The dataset consists of real word images, with a minimum of 1000 images per language, accompanied by their corresponding labels in Unicode. This dataset can serve various purposes such as script identification, scene text detection, and recognition.

We employed a web crawling approach to assemble this dataset, specifically gathering images from Google Images through targeted keyword-based searches. Our methodology ensured coverage of diverse everyday scenarios where Indic language text is commonly encountered. Examples include wall paintings, railway stations, signboards, nameplates of shops, temples, mosques, and gurudwaras, advertisements, banners, political protests, and house plates. Since the images were sourced from a search engine, they originate from various contexts, providing various conditions under which the images were captured.

The curated dataset encompasses images with different characteristics, such as blurriness, non-iconic or iconic text, low resolution, occlusions, curved text, and perspective projections resulting from non-frontal viewpoints. This diversity in image attributes adds to the dataset's realism and utility for various research and application domains

Additionally, we introduce a synthetic dataset specifically designed for 13 Indian languages (including Manipuri - Meitei Script). This dataset aims to advance the field of Scene Text Recognition (STR) by enabling research and development in the area of multi-lingual STR. In essence, this synthetic dataset serves a similar purpose as the well-known SynthText and MJSynth datasets, providing valuable resources for training and evaluating text recognition models.

Benchmarking Approach

Parseq

PARSeq architecture. [B] and [P] begin the sequence and padding tokens. T=30 or 30 distinct position tokens. LCE corresponds to cross-entropy loss.

For the IndicSTR12 dataset, three models were selected for benchmarking the performance of Scene Text Recognition (STR) on 12 Indian languages. These models are as follows:

PARSeq: This model is the current state-of-the-art for Latin STR and it achieves high accuracy.

CRNN: Despite having lower accuracy compared to many current models, CRNN is widely adopted by the STR community for practical purposes due to its lightweight nature and fast processing speed.

STARNet: This model excels at extracting robust features from word-images and includes an initial distortion correction step on top of CRNN architecture. It has been chosen for benchmarking to maintain consistency with previous research on Indic STR.

These three models were specifically chosen to evaluate and compare their performance on the IndicSTR12 dataset, enabling researchers to assess the effectiveness of various STR approaches on the Indian languages included in the dataset.

CRNN

STARNet model (left) and CRNN model (right)

Result

Synthetic performance

Real performance

Dataset

font variation

IndicSTR12 Dataset: Font Variations for the same word - Gujarati or Gujarat
 

Diversity quality

IndicSTR12 Dataset Variations, clockwise from Top-Left: Illumination variation, Low Resolution, Multi-Oriented - Irregular Text, Variation in Text Length, Perspective Text, and Occluded.
 

Citation

 @inproceedings{lunia2023indicstr12,
  title={IndicSTR12: A Dataset for Indic Scene Text Recognition},
  author={Lunia, Harsh and Mondal, Ajoy and Jawahar, CV},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={233--250},
  year={2023},
  organization={Springer}
}

Acknowledgements

This work is supported by MeitY, Government of India, through the NLTM-Bhashini project.

 

 

Reading Between the Lanes: Text VideoQA on the Road

George Tom , Minesh Mathew , Sergi Garcia , Dimosthenis Karatzas , C.V. Jawahar

Center for Visual Information Technology (CVIT), IIIT Hyderabad
Computer Vision Center (CVC), UAB, Spain
AllRead Machine Learning Technologies

ICDAR, 2023

[ Paper ] [ Dataset ]

 

roadtextvqa02

Abstract

 

Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of 3,222 driving videos collected from multiple countries, annotated with 10,500 questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the usefulness of the dataset in advancing research on in-vehicle support systems and text-aware multimodal question answering.

 

 

Contact

This email address is being protected from spambots. You need JavaScript enabled to view it.

More Articles …

  1. Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale
  2. FaceOff: A Video-to-Video Face Swapping System
  3. INR-V: A Continuous Representation Space for Video-based Generative Tasks
  4. Watching the News: Towards VideoQA Models that can Read
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.