Reading Between the Lanes: Text VideoQA on the Road

George Tom , Minesh Mathew , Sergi Garcia , Dimosthenis Karatzas , C.V. Jawahar

Center for Visual Information Technology (CVIT), IIIT Hyderabad
Computer Vision Center (CVC), UAB, Spain
AllRead Machine Learning Technologies

Abstract

Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of 3,222 driving videos collected from multiple countries, annotated with 10,500 questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the usefulness of the dataset in advancing research on in-vehicle support systems and text-aware multimodal question answering.

Contact

This email address is being protected from spambots. You need JavaScript enabled to view it.

Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale

Aditya Agarwal^1, Bipasha Sen^1, Rudrabha Mukhopadhyay¹, Vinay P Namboodiri², C V Jawahar¹

¹IIIT Hyderabad, India
²University of Bath, UK

* indicates equal contribution

[Paper] [Video]

Abstract

banner page 0001

Many people with some form of hearing loss consider lipreading as their primary mode of day-to-day communication. However, finding resources to learn or improve one's lipreading skills can be challenging. This is further exacerbated in the COVID19 pandemic due to restrictions on direct interactions with peers and speech therapists. Today, online MOOCs platforms like Coursera and Udemy have become the most effective form of training for many types of skill development. However, online lipreading resources are scarce as creating such resources is an extensive process needing months of manual effort to record hired actors. Because of the manual pipeline, such platforms are also limited in vocabulary, supported languages, accents, and speakers and have a high usage cost. In this work, we investigate the possibility of replacing real human talking videos with synthetically generated videos. Synthetic data can easily incorporate larger vocabularies, variations in accent, and even local languages and many speakers. We propose an end-to-end automated pipeline to develop such a platform using state-of-the-art talking head video generator networks, text-to-speech models, and computer vision techniques. We then perform an extensive human evaluation using carefully thought out lipreading xercises to validate the quality of our designed platform against the existing lipreading platforms. Our studies concretely point toward the potential of our approach in developing a large-scale lipreading MOOC platform that can impact millions of people with hearing loss.

Overview

pipeline revised6 page 0001

Lipreading is a primary mode of communication for people with hearing loss. However, learning to lipread is not an easy task! Lipreading can be thought of being analogous to "learning a new language" for people without hearing disabilities. People needing this skill undergo formal education in special schools and involve medically trained speech therapists. Other resources like daily interactions also help understand and decipher language solely from lip movements. However, these resources are highly constrained and inadequate for many patients suffering from hearing disabilities. We envision a MOOCs platform for LipReading Training (LRT) for the hearing disabled. We propose a novel approach to automatically generate a large-scale database for developing an LRT MOOCs platform. We use SOTA text-to-speech (TTS) models and talking head generators like Wav2Lip to generate training examples automatically. Wav2Lip requires driving face videos and driving speech segments (generated from TTS in our case) to generate lip-synced talking head videos according to driving speech. It preserves the head pose, background, identity, and distance of the person from the camera while modifying only the lip movements. Our approach can exponentially increase the amount of online content on the LRT platforms in an automated and cost-effective manner. It can also seamlessly increase the vocabulary and the number of speakers in the database.

Test Design

quiz 9.pdf page 0001

Lipreading is an involved process of recognizing speech from visual cues - the shape formed by the lips, teeth, and tongue. A lipreader may also rely on several other factors, such as the context of the conversation, familiarity with the speaker, vocabulary, and accent. Thus, taking inspiration from lipreading.org and readourlips.ca, we define three lipreading protocols for conducting a user study to evaluate the viability of our platform - (1) lipreading on isolated words (WL), (2) lipreading sentences with context (SL), and (3) lipreading missing words in sentences (MWIS). These protocols rely on a lipreader's vocabulary and the role that semantic context plays in a person's ability to lipread. In word-level (WL) lipreading, the user is presented with a video of an isolated word being spoken by a talking head, along with multiple choices and one correct answer. When a video is played on the screen, the user must respond by selecting a single response from the provided multiple choices. Visually similar words (homophenes) are placed as options in the multiple choices to increase the difficulty of the task. The difficulty can be further increased by testing for difficult words - difficulty associated with the word to lipread. In sentence-level (SL) lipreading, the users are presented with (1) videos of talking heads speaking entire sentences and (2) the context of the sentences. The context acts as an additional cue to the mouthing of sentences and is meant to simulate practical conversations in a given context. In lipreading missing words in sentences (MWIS), the participants watch videos of sentences spoken by a talking head with a word in the sentence masked. Unlike SL, the users are not provided with any additional sentence context. Lip movements are an ambiguous source of information due to the presence of homophenes. This exercise thus aims to use the context of the sentence to disambiguate between multiple possibilities and guess the correct answer.

User Study

errorbar combined.V2 page 0001

We conduct statistical analysis to verify (T1) If the lipreading performance of the users remains comparable across the real and synthetic videos generated using our pipeline. Through this, we will validate the viability of our proposed pipeline as an alternative to the existing online lipreading training platforms. (T2) If the users are more comfortable lipreading in their native accent/language than in a foreign accent/language. This would validate the need for bootstrapping lipreading training platforms in multiple languages/accents across the globe. The mean user performance on the three lipreading protocols are shown as standard errors of the mean.

Citation

@misc{agarwal2023lrt, 

     doi = {10.48550/ARXIV.2208.09796}, 

     url = {https://arxiv.org/abs/2208.09796}, 

     author = {Agarwal, Aditya and Sen, Bipasha and Mukhopadhyay, Rudrabha and Namboodiri, Vinay and Jawahar, C. V.}, 

     keywords = {Computer Vision and Pattern Recognition (cs.CV)}, 

     title = {Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale}, 

     publisher = {IEEE/CVF Winter Conference on Applications of Computer Vision}, 

     year = {2023}, 

}

FaceOff: A Video-to-Video Face Swapping System

Aditya Agarwal^1, Bipasha Sen^1, Rudrabha Mukhopadhyay¹, Vinay P Namboodiri², C V Jawahar¹

¹IIIT Hyderabad, India
²University of Bath, UK

* indicates equal contribution

[Paper] [Video] [Code]

Abstract

teaser abstract faceoff

Doubles play an indispensable role in the movie industry. They take the place of the actors in dangerous stunt scenes or scenes where the same actor plays multiple characters. The double's face is later replaced with the actor's face and expressions manually using expensive CGI technology, costing millions of dollars and taking months to complete. An automated, inexpensive, and fast way can be to use face-swapping techniques that aim to swap an identity from a source face video (or an image) to a target face video. However, such methods cannot preserve the source expressions of the actor important for the scene's context. To tackle this challenge, we introduce video-to-video (V2V) face-swapping, a novel task of face-swapping that can preserve (1) the identity and expressions of the source (actor) face video and (2) the background and pose of the target (double) video. We propose FaceOff, a V2V face-swapping system that operates by learning a robust blending operation to merge two face videos following the constraints above. It reduces the videos to a quantized latent space and then blends them in the reduced space. FaceOff is trained in a self-supervised manner and robustly tackles the non-trivial challenges of V2V face-swapping. As shown in the experimental section, FaceOff significantly outperforms alternate approaches qualitatively and quantitatively.

Overview

architecture faceoff

Swapping faces across videos is non-trivial as it involves merging two different motions - the actor's face motion and the double's head motion. This needs a network that can take two different motions as input and produce a third coherent motion. FaceOff is a video-to-video face swapping system that reduces the face videos to a quantized latent space and blends them in the reduced space. A fundamental challenge in training such a network is the absence of ground truth. FaceOff uses a self-supervised training strategy for training: A single video is used as the source and target. We then introduce pseudo motion errors on the source video. Finally, we train a network to fix these pseudo errors to regenerate the source video. To do this, we learn to blend the foreground of the source video with the background and pose of the target face video such that the blended output is coherent and meaningful. We use a temporal autoencoding module that merges the motion of the source and the target video using a quantized latent space. We propose a modified vector quantized encoder with temporal modules made of non-linear 3D convolution operations to encode the video to the quantized latent space. The input to the encoder is a single video made by concatenating the source foreground and target background frames channel-wise. The encoder first encodes the concatenated video input framewise into 32x32 and 64x64 dimensional top and bottom hierarchies, respectively. Before the quantization step at each of the hierarchies, the temporal modules process the reduced video frames. This step allows the network to backpropagate with temporal connections between the frames. The decoder then decodes the reduced frames using a distance loss with the ground truth video as supervision. The output is a temporally and spatially coherent blended video of the source foreground and the target background.

FaceOff Video-to-Video Face Swapping

v2v face swapping

v2v faceswapping looped2

Training Pipeline

training pipeline

Inference Pipeline

inference pipeline

Results on Unseen Identities

v2v results1

v2v results2

v2v results3

v2v results4

Comparisons

v2v comparisons1

v2v comparisons2

v2v comparisons31

Results on Same Identity

v2v same identity1

v2v same identity2

v2v same identity3

Some More Results

v2v more result

Citation

@misc{agarwal2023faceoff,
 doi = {10.48550/ARXIV.2208.09788},
 url = {https://arxiv.org/abs/2208.09788},
 author = {Agarwal, Aditya and Sen, Bipasha and Mukhopadhyay, Rudrabha and Namboodiri, Vinay and Jawahar, C. V.},
 keywords = {Computer Vision and Pattern Recognition (cs.CV)},
 title = {FaceOff: A Video-to-Video Face Swapping System},
 publisher = {IEEE/CVF Winter Conference on Applications of Computer Vision},
 year = {2023},
}

INR-V: A Continuous Representation Space for Video-based Generative Tasks

Bipasha Sen^1 Aditya Agarwal^1, Vinay P Namboodiri² and C.V. Jawahar

¹IIIT Hyderabad, India

²University of Bath, UK

* indicates equal contribution

TMLR, 2022

[ Paper ] | [ Video ] | [ Inference Code ] | [ OpenReview ]

Abstract

Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This limits the expressivity of videos to only image-based operations on the individual video frames needing network designs to obtain temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. In this work, we evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showing the potential of the proposed representation space.

Overview

architecture overview

We parameterize videos as a function of space and time using implicit neural representations (INRs). Any point in a video Vhwt can be represented by a function fΘ→ RGBhwt where t denotes the tth frame in the video and h, w denote the spatial location in the frame, and RGB denotes the color at the pixel position {h, w, t}. Subsequently, the dynamic dimension of videos (a few million pixels) is reduced to a constant number of weights Θ (a few thousand) required for the parameterization. A network can then be used to learn a prior over videos in this parameterized space. This can be obtained through a meta-network that learns a function to map from a latent space to a reduced parameter space that maps to a video. A complete video is thus represented as a single latent point. We use a meta-network called hypernetworks that learns a continuous function over the INRs by getting trained on multiple video instances using a distance loss. However, hypernetworks are notoriously unstable to train, especially on the parameterization of highly expressive signals like videos. Thus, we propose key prior regularization and a progressive weight initialization scheme to stabilize the hypernetwork training allowing it to scale quickly to more than 30,000 videos. The learned prior enables several downstream tasks such as novel video generation, video inversion, future segment prediction, video inpainting, and smooth video interpolation directly at the video level.

comp1

comp2

inpainting

inversion other

superresolve

add1

Additional Interpolation Results

grid 0

grid 1

grid 2

grid 3

grid 4

grid 5

grid 6

grid 7

grid 8

Citation

@article{ sen2022inrv,
   title={ {INR}-V: A Continuous Representation Space for Video-based Generative Tasks},
   author={Bipasha Sen and Aditya Agarwal and Vinay P Namboodiri and C.V. Jawahar},
   journal={Transactions on Machine Learning Research},
   year={2022},
   url={https://openreview.net/forum?id=aIoEkwc2oB},
   note={}
}

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. It is borrowing the source code of this website.

Watching the News: Towards VideoQA Models that can Read

Soumya Jahagirdar^† , Minesh Mathew^† , Dimosthenis Karatzas^‡ , C. V. Jawahar^† ,

^†CVIT, IIIT Hyderabad, India
^‡Computer Vision Center, UAB, Spain

[Paper] [Video] [Code] [Dataset]

Abstract

teaser abstract faceoff

We address the task of text based Video Question Answering, incorporating VideoText (VideoText is the textual content embedded in the videos) information (bottom right). We propose a new dataset of News Videos along with QA annotations grounded on video text, and explore VQA models that jointly reason over temporal and text based information.

Video Question Answering methods focus on commonsense reasoning and visual cognition of objects or persons and their interactions over time. Current VideoQA approaches ignore the textual information present in the video. Instead, we argue that textual information is complementary to the action and provides essential contextualisationcues to the reasoning process. To this end, we propose a novel VideoQA task that requires reading and understanding the text in the video. To explore this direction, we focus on news videos and require QA systems to comprehend and answer questions about the topics presented by combining visual and textual cues in the video. We introduce the “NewsVideoQA” dataset that comprises more than 8, 600 QA pairs on 3, 000+ news videos obtained from diverse news channels from around the world. We demonstrate the limitations of current Scene Text VQA and VideoQA methods and propose ways to incorporate scene text information into VideoQA methods.

Reading Between the Lanes: Text VideoQA on the Road

George Tom , Minesh Mathew , Sergi Garcia , Dimosthenis Karatzas , C.V. Jawahar

Center for Visual Information Technology (CVIT), IIIT Hyderabad Computer Vision Center (CVC), UAB, Spain AllRead Machine Learning Technologies

ICDAR, 2023

[ Paper ] [ Dataset ]

Abstract

Contact

Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale

Aditya Agarwal*1, Bipasha Sen*1, Rudrabha Mukhopadhyay1, Vinay P Namboodiri2, C V Jawahar1

1IIIT Hyderabad, India 2University of Bath, UK

* indicates equal contribution

[Paper] [Video]

Abstract

Overview

Test Design

User Study

Citation

FaceOff: A Video-to-Video Face Swapping System

Aditya Agarwal*1, Bipasha Sen*1, Rudrabha Mukhopadhyay1, Vinay P Namboodiri2, C V Jawahar1

1IIIT Hyderabad, India 2University of Bath, UK

* indicates equal contribution

[Paper] [Video] [Code]

Abstract

Overview

FaceOff Video-to-Video Face Swapping

Training Pipeline

Inference Pipeline

Results on Unseen Identities

Comparisons

Results on Same Identity

Some More Results

Citation

INR-V: A Continuous Representation Space for Video-based Generative Tasks

Bipasha Sen*1 Aditya Agarwal*1, Vinay P Namboodiri2 and C.V. Jawahar

1IIIT Hyderabad, India

2University of Bath, UK

TMLR, 2022

[ Paper ] | [ Video ] | [ Inference Code ] | [ OpenReview ]

Abstract

Overview

Additional Interpolation Results

Citation

Watching the News: Towards VideoQA Models that can Read

Soumya Jahagirdar† , Minesh Mathew† , Dimosthenis Karatzas‡ , C. V. Jawahar† ,

†CVIT, IIIT Hyderabad, India ‡Computer Vision Center, UAB, Spain

[Paper] [Video] [Code] [Dataset]

Abstract

More Articles …

Center for Visual Information Technology (CVIT), IIIT Hyderabad
Computer Vision Center (CVC), UAB, Spain
AllRead Machine Learning Technologies

Aditya Agarwal^1, Bipasha Sen^1, Rudrabha Mukhopadhyay¹, Vinay P Namboodiri², C V Jawahar¹

¹IIIT Hyderabad, India
²University of Bath, UK

Aditya Agarwal^1, Bipasha Sen^1, Rudrabha Mukhopadhyay¹, Vinay P Namboodiri², C V Jawahar¹

¹IIIT Hyderabad, India
²University of Bath, UK

Bipasha Sen^1 Aditya Agarwal^1, Vinay P Namboodiri² and C.V. Jawahar

¹IIIT Hyderabad, India

²University of Bath, UK

Soumya Jahagirdar^† , Minesh Mathew^† , Dimosthenis Karatzas^‡ , C. V. Jawahar^† ,

^†CVIT, IIIT Hyderabad, India
^‡Computer Vision Center, UAB, Spain