CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • MS Thesis
    • PhD Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Past Announcements
  • Contact Us
  • Login

Early Anticipation of Driving Maneuvers

Abdul Wasi1, Shankar Gangisetty1, Shyam Nanadan2, C V Jawahar1

IIIT Hyderabad1, India and Politecnico di Torino2, Italy

[Paper]       [Code]     [Dataset]

 

daad1

Fig.: Overview of DAAD dataset for ADM task. Left: Shows previous datasets containing maneuver videos from their initiation to their execution (DM), whereas our DAAD dataset features longer video sequences providing prior context (BM), which proves beneficial for early maneuver anticipation. Right: Illustrates the multi-view and multi-modality (Gaze through the egocentric view) in DAAD for ADM.

Abstract

Prior works have addressed the problem of driver intention prediction (DIP) to identify maneuvers post their onset. On the other hand, early anticipation is equally important in scenarios that demand a preemptive response before a maneuver begins. However, there is no prior work aimed at addressing the problem of driver action anticipation before the onset of the maneuver, limiting the ability of the advanced driver assistance system (ADAS) for early maneuver anticipation. In this work, we introduce Anticipating Driving Maneuvers (ADM), a new task that enables driver action anticipation before the onset of the maneuver. To initiate research in this new task, we curate Driving Action Anticipation Dataset, DAAD, that is multi-view: in- and out-cabin views in dense and heterogeneous scenarios, and multimodal: egocentric view and gaze information. The dataset captures sequences both before the initiation and during the execution of a maneuver. During dataset collection, we also ensure to capture wide diversity in traffic scenarios, weather and illumination, and driveway conditions. Next, we propose a strong baseline based on transformer architecture to effectively model multiple views and modalities over longer video lengths. We benchmark the existing DIP methods on DAAD and related datasets. Finally, we perform an ablation study showing the effectiveness of multiple views and modalities in maneuver anticipation.

The DAAD dataset

 

daad2

 Fig.: Data samples. DAAD in comparison to Brain4Cars, VIENA2 and HDD datasets. DAAD exhibits great diversity in various driving conditions (traffic density, day/night, weather, type of routes) across different driving maneuvers. 

Results

 

daad3

Fig.: Effect of time-to-maneuver. (a) Accuracy over time for different driving datasets on CEMFormer (with ViT encoder). We conducted three separate experiments for the DAAD dataset. (i) DAAD-DM: Training and testing only on the maneuver sequences (DM). (ii) DAAD-Full: Training and testing on the whole video. (iii) DAAD-BM: Training on a portion of the video captured before the onset of maneuver (BM) and testing on the whole video; (b) Accuracy over time for our dataset on ViT, MViT, MViTv2 encoders, and the proposed method (M2MVT). Here, t is the time of onset of the maneuver.

Dataset

  • Driver View
  • Front View
  • Gaze View
  • Left View
  • Rear View
  • Right View

Citation

@inproceedings{adm2024daad,
  author       = {Abdul Wasi, Shankar Gangisetty, Shyam Nandan Rai and C. V. Jawahar},
  title        = {Early Anticipation of Driving Maneuvers},
  booktitle    = {ECCV (70)},
  series       = {Lecture Notes in Computer Science},
  volume       = {15128},
  pages        = {152--169},
  publisher    = {Springer},
  year         = {2024}
}

 

Acknowledgments

This work is supported by iHub-Data and mobility at IIIT Hyderabad.

 

 

TBD

IndicSTR12: A Dataset for Indic Scene Text Recognition

Harsh Lunia, Ajoy Mondal, C V Jawahar

[Paper]    [Code]     [Synthetic Dataset]      [Real Dataset] [Poster]     

ICDAR , 2023

 

 

Abstract

real synthetic

Samples from IndicSTR12 Dataset: Real word-images (left); Synthetic word-images (right)

 

We present a comprehensive dataset comprising 12 major Indian languages, including Assamese, Bengali, Odia, Marathi, Hindi, Kannada, Urdu, Telugu, Malayalam, Tamil, Gujarati, and Punjabi. The dataset consists of real word images, with a minimum of 1000 images per language, accompanied by their corresponding labels in Unicode. This dataset can serve various purposes such as script identification, scene text detection, and recognition.

We employed a web crawling approach to assemble this dataset, specifically gathering images from Google Images through targeted keyword-based searches. Our methodology ensured coverage of diverse everyday scenarios where Indic language text is commonly encountered. Examples include wall paintings, railway stations, signboards, nameplates of shops, temples, mosques, and gurudwaras, advertisements, banners, political protests, and house plates. Since the images were sourced from a search engine, they originate from various contexts, providing various conditions under which the images were captured.

The curated dataset encompasses images with different characteristics, such as blurriness, non-iconic or iconic text, low resolution, occlusions, curved text, and perspective projections resulting from non-frontal viewpoints. This diversity in image attributes adds to the dataset's realism and utility for various research and application domains

Additionally, we introduce a synthetic dataset specifically designed for 13 Indian languages (including Manipuri - Meitei Script). This dataset aims to advance the field of Scene Text Recognition (STR) by enabling research and development in the area of multi-lingual STR. In essence, this synthetic dataset serves a similar purpose as the well-known SynthText and MJSynth datasets, providing valuable resources for training and evaluating text recognition models.

Benchmarking Approach

Parseq

PARSeq architecture. [B] and [P] begin the sequence and padding tokens. T=30 or 30 distinct position tokens. LCE corresponds to cross-entropy loss.

For the IndicSTR12 dataset, three models were selected for benchmarking the performance of Scene Text Recognition (STR) on 12 Indian languages. These models are as follows:

PARSeq: This model is the current state-of-the-art for Latin STR and it achieves high accuracy.

CRNN: Despite having lower accuracy compared to many current models, CRNN is widely adopted by the STR community for practical purposes due to its lightweight nature and fast processing speed.

STARNet: This model excels at extracting robust features from word-images and includes an initial distortion correction step on top of CRNN architecture. It has been chosen for benchmarking to maintain consistency with previous research on Indic STR.

These three models were specifically chosen to evaluate and compare their performance on the IndicSTR12 dataset, enabling researchers to assess the effectiveness of various STR approaches on the Indian languages included in the dataset.

CRNN

STARNet model (left) and CRNN model (right)

Result

Synthetic performance

Real performance

Dataset

font variation

IndicSTR12 Dataset: Font Variations for the same word - Gujarati or Gujarat
 

Diversity quality

IndicSTR12 Dataset Variations, clockwise from Top-Left: Illumination variation, Low Resolution, Multi-Oriented - Irregular Text, Variation in Text Length, Perspective Text, and Occluded.
 

Citation

 @inproceedings{lunia2023indicstr12,
  title={IndicSTR12: A Dataset for Indic Scene Text Recognition},
  author={Lunia, Harsh and Mondal, Ajoy and Jawahar, CV},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={233--250},
  year={2023},
  organization={Springer}
}

Acknowledgements

This work is supported by MeitY, Government of India, through the NLTM-Bhashini project.

 

 

Reading Between the Lanes: Text VideoQA on the Road

George Tom , Minesh Mathew , Sergi Garcia , Dimosthenis Karatzas , C.V. Jawahar

Center for Visual Information Technology (CVIT), IIIT Hyderabad
Computer Vision Center (CVC), UAB, Spain
AllRead Machine Learning Technologies

ICDAR, 2023

[ Paper ] [ Dataset ]

 

roadtextvqa02

Abstract

 

Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of 3,222 driving videos collected from multiple countries, annotated with 10,500 questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the usefulness of the dataset in advancing research on in-vehicle support systems and text-aware multimodal question answering.

 

 

Contact

This email address is being protected from spambots. You need JavaScript enabled to view it.

Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale

 

Aditya Agarwal*1, Bipasha Sen*1, Rudrabha Mukhopadhyay1, Vinay P Namboodiri2, C V Jawahar1

1IIIT Hyderabad, India
2University of Bath, UK

* indicates equal contribution

 

[Paper]      [Video]

 

Abstract

banner page 0001

Many people with some form of hearing loss consider lipreading as their primary mode of day-to-day communication. However, finding resources to learn or improve one's lipreading skills can be challenging. This is further exacerbated in the COVID19 pandemic due to restrictions on direct interactions with peers and speech therapists. Today, online MOOCs platforms like Coursera and Udemy have become the most effective form of training for many types of skill development. However, online lipreading resources are scarce as creating such resources is an extensive process needing months of manual effort to record hired actors. Because of the manual pipeline, such platforms are also limited in vocabulary, supported languages, accents, and speakers and have a high usage cost. In this work, we investigate the possibility of replacing real human talking videos with synthetically generated videos. Synthetic data can easily incorporate larger vocabularies, variations in accent, and even local languages and many speakers. We propose an end-to-end automated pipeline to develop such a platform using state-of-the-art talking head video generator networks, text-to-speech models, and computer vision techniques. We then perform an extensive human evaluation using carefully thought out lipreading xercises to validate the quality of our designed platform against the existing lipreading platforms. Our studies concretely point toward the potential of our approach in developing a large-scale lipreading MOOC platform that can impact millions of people with hearing loss.

 

Overview

pipeline revised6 page 0001

Lipreading is a primary mode of communication for people with hearing loss. However, learning to lipread is not an easy task! Lipreading can be thought of being analogous to "learning a new language" for people without hearing disabilities. People needing this skill undergo formal education in special schools and involve medically trained speech therapists. Other resources like daily interactions also help understand and decipher language solely from lip movements. However, these resources are highly constrained and inadequate for many patients suffering from hearing disabilities. We envision a MOOCs platform for LipReading Training (LRT) for the hearing disabled. We propose a novel approach to automatically generate a large-scale database for developing an LRT MOOCs platform. We use SOTA text-to-speech (TTS) models and talking head generators like Wav2Lip to generate training examples automatically. Wav2Lip requires driving face videos and driving speech segments (generated from TTS in our case) to generate lip-synced talking head videos according to driving speech. It preserves the head pose, background, identity, and distance of the person from the camera while modifying only the lip movements. Our approach can exponentially increase the amount of online content on the LRT platforms in an automated and cost-effective manner. It can also seamlessly increase the vocabulary and the number of speakers in the database.

 

Test Design

 quiz 9.pdf page 0001

Lipreading is an involved process of recognizing speech from visual cues - the shape formed by the lips, teeth, and tongue. A lipreader may also rely on several other factors, such as the context of the conversation, familiarity with the speaker, vocabulary, and accent. Thus, taking inspiration from lipreading.org and readourlips.ca, we define three lipreading protocols for conducting a user study to evaluate the viability of our platform - (1) lipreading on isolated words (WL), (2) lipreading sentences with context (SL), and (3) lipreading missing words in sentences (MWIS). These protocols rely on a lipreader's vocabulary and the role that semantic context plays in a person's ability to lipread. In word-level (WL) lipreading, the user is presented with a video of an isolated word being spoken by a talking head, along with multiple choices and one correct answer. When a video is played on the screen, the user must respond by selecting a single response from the provided multiple choices. Visually similar words (homophenes) are placed as options in the multiple choices to increase the difficulty of the task. The difficulty can be further increased by testing for difficult words - difficulty associated with the word to lipread. In sentence-level (SL) lipreading, the users are presented with (1) videos of talking heads speaking entire sentences and (2) the context of the sentences. The context acts as an additional cue to the mouthing of sentences and is meant to simulate practical conversations in a given context. In lipreading missing words in sentences (MWIS), the participants watch videos of sentences spoken by a talking head with a word in the sentence masked. Unlike SL, the users are not provided with any additional sentence context. Lip movements are an ambiguous source of information due to the presence of homophenes. This exercise thus aims to use the context of the sentence to disambiguate between multiple possibilities and guess the correct answer.

 

User Study

 errorbar combined.V2 page 0001

We conduct statistical analysis to verify (T1) If the lipreading performance of the users remains comparable across the real and synthetic videos generated using our pipeline. Through this, we will validate the viability of our proposed pipeline as an alternative to the existing online lipreading training platforms. (T2) If the users are more comfortable lipreading in their native accent/language than in a foreign accent/language. This would validate the need for bootstrapping lipreading training platforms in multiple languages/accents across the globe. The mean user performance on the three lipreading protocols are shown as standard errors of the mean.

Citation

 

@misc{agarwal2023lrt, 
doi = {10.48550/ARXIV.2208.09796},
url = {https://arxiv.org/abs/2208.09796},
author = {Agarwal, Aditya and Sen, Bipasha and Mukhopadhyay, Rudrabha and Namboodiri, Vinay and Jawahar, C. V.},
keywords = {Computer Vision and Pattern Recognition (cs.CV)},
title = {Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale},
publisher = {IEEE/CVF Winter Conference on Applications of Computer Vision},
year = {2023},
}

More Articles …

  1. FaceOff: A Video-to-Video Face Swapping System
  2. INR-V: A Continuous Representation Space for Video-based Generative Tasks
  3. Watching the News: Towards VideoQA Models that can Read
  4. UAV-based Visual Remote Sensing for Automated Building Inspection (UVRSABI)
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.