CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
    • Post-doctoral
    • Honours Student
  • Research
    • Publications
    • Journals
    • Books
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

FaceOff: A Video-to-Video Face Swapping System

Aditya Agarwal*1, Bipasha Sen*1, Rudrabha Mukhopadhyay1, Vinay P Namboodiri2, C V Jawahar1

1IIIT Hyderabad, India
2University of Bath, UK

* indicates equal contribution

[Paper]       [Video]     [Code]

 

 

Abstract

teaser abstract faceoff

Doubles play an indispensable role in the movie industry. They take the place of the actors in dangerous stunt scenes or scenes where the same actor plays multiple characters. The double's face is later replaced with the actor's face and expressions manually using expensive CGI technology, costing millions of dollars and taking months to complete. An automated, inexpensive, and fast way can be to use face-swapping techniques that aim to swap an identity from a source face video (or an image) to a target face video. However, such methods cannot preserve the source expressions of the actor important for the scene's context. To tackle this challenge, we introduce video-to-video (V2V) face-swapping, a novel task of face-swapping that can preserve (1) the identity and expressions of the source (actor) face video and (2) the background and pose of the target (double) video. We propose FaceOff, a V2V face-swapping system that operates by learning a robust blending operation to merge two face videos following the constraints above. It reduces the videos to a quantized latent space and then blends them in the reduced space. FaceOff is trained in a self-supervised manner and robustly tackles the non-trivial challenges of V2V face-swapping. As shown in the experimental section, FaceOff significantly outperforms alternate approaches qualitatively and quantitatively.

 

Overview

architecture faceoff

Swapping faces across videos is non-trivial as it involves merging two different motions - the actor's face motion and the double's head motion. This needs a network that can take two different motions as input and produce a third coherent motion. FaceOff is a video-to-video face swapping system that reduces the face videos to a quantized latent space and blends them in the reduced space. A fundamental challenge in training such a network is the absence of ground truth. FaceOff uses a self-supervised training strategy for training: A single video is used as the source and target. We then introduce pseudo motion errors on the source video. Finally, we train a network to fix these pseudo errors to regenerate the source video. To do this, we learn to blend the foreground of the source video with the background and pose of the target face video such that the blended output is coherent and meaningful. We use a temporal autoencoding module that merges the motion of the source and the target video using a quantized latent space. We propose a modified vector quantized encoder with temporal modules made of non-linear 3D convolution operations to encode the video to the quantized latent space. The input to the encoder is a single video made by concatenating the source foreground and target background frames channel-wise. The encoder first encodes the concatenated video input framewise into 32x32 and 64x64 dimensional top and bottom hierarchies, respectively. Before the quantization step at each of the hierarchies, the temporal modules process the reduced video frames. This step allows the network to backpropagate with temporal connections between the frames. The decoder then decodes the reduced frames using a distance loss with the ground truth video as supervision. The output is a temporally and spatially coherent blended video of the source foreground and the target background.

 

FaceOff Video-to-Video Face Swapping

v2v face swapping

v2v faceswapping looped2

 

Training Pipeline

training pipeline 

Inference Pipeline

 inference pipeline

Results on Unseen Identities

 v2v results1

v2v results2

v2v results3

v2v results4

Comparisons

v2v comparisons1

v2v comparisons2

v2v comparisons31

 

Results on Same Identity

v2v same identity1

v2v same identity2

v2v same identity3

 

Some More Results

v2v more result

 

Citation

@misc{agarwal2023faceoff,
doi = {10.48550/ARXIV.2208.09788},
url = {https://arxiv.org/abs/2208.09788},
author = {Agarwal, Aditya and Sen, Bipasha and Mukhopadhyay, Rudrabha and Namboodiri, Vinay and Jawahar, C. V.},
keywords = {Computer Vision and Pattern Recognition (cs.CV)},
title = {FaceOff: A Video-to-Video Face Swapping System},
publisher = {IEEE/CVF Winter Conference on Applications of Computer Vision},
year = {2023},
}

 

INR-V: A Continuous Representation Space for Video-based Generative Tasks


Bipasha Sen*1 Aditya Agarwal*1, Vinay P Namboodiri2 and C.V. Jawahar

1IIIT Hyderabad, India

2University of Bath, UK

* indicates equal contribution

TMLR, 2022

[ Paper ]   | [ Video ] | [ Inference Code ] | [ OpenReview ]

 

 

 

 

 

 

Abstract

banner. V18

Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This limits the expressivity of videos to only image-based operations on the individual video frames needing network designs to obtain temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. In this work, we evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showing the potential of the proposed representation space.
 

Overview

architecture overview

We parameterize videos as a function of space and time using implicit neural representations (INRs). Any point in a video Vhwt can be represented by a function fΘ→ RGBhwt where t denotes the tth frame in the video and h, w denote the spatial location in the frame, and RGB denotes the color at the pixel position {h, w, t}. Subsequently, the dynamic dimension of videos (a few million pixels) is reduced to a constant number of weights Θ (a few thousand) required for the parameterization. A network can then be used to learn a prior over videos in this parameterized space. This can be obtained through a meta-network that learns a function to map from a latent space to a reduced parameter space that maps to a video. A complete video is thus represented as a single latent point. We use a meta-network called hypernetworks that learns a continuous function over the INRs by getting trained on multiple video instances using a distance loss. However, hypernetworks are notoriously unstable to train, especially on the parameterization of highly expressive signals like videos. Thus, we propose key prior regularization and a progressive weight initialization scheme to stabilize the hypernetwork training allowing it to scale quickly to more than 30,000 videos. The learned prior enables several downstream tasks such as novel video generation, video inversion, future segment prediction, video inpainting, and smooth video interpolation directly at the video level.

 

comp1

 comp2

inv

inpainting

inversion other

superresolve

add1

 

Additional Interpolation Results

 grid 0

grid 1

grid 2

grid 3

grid 4

grid 5

grid 6

grid 7

grid 8

 

Citation

@article{ sen2022inrv,
   title={ {INR}-V: A Continuous Representation Space for Video-based Generative Tasks},
   author={Bipasha Sen and Aditya Agarwal and Vinay P Namboodiri and C.V. Jawahar},
   journal={Transactions on Machine Learning Research},
   year={2022},
   url={https://openreview.net/forum?id=aIoEkwc2oB},
   note={}
} 

 

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. It is borrowing the source code of this website.

 


Watching the News: Towards VideoQA Models that can Read

Soumya Jahagirdar† , Minesh Mathew† , Dimosthenis Karatzas‡ , C. V. Jawahar† ,

†CVIT, IIIT Hyderabad, India
‡Computer Vision Center, UAB, Spain

[Paper]       [Video]     [Code]     [Dataset]

 

 

Abstract

teaser abstract faceoff

We address the task of text based Video Question Answering, incorporating VideoText (VideoText is the textual content embedded in the videos) information (bottom right). We propose a new dataset of News Videos along with QA annotations grounded on video text, and explore VQA models that jointly reason over temporal and text based information.

 

Video Question Answering methods focus on commonsense reasoning and visual cognition of objects or persons and their interactions over time. Current VideoQA approaches ignore the textual information present in the video. Instead, we argue that textual information is complementary to the action and provides essential contextualisationcues to the reasoning process. To this end, we propose a novel VideoQA task that requires reading and understanding the text in the video. To explore this direction, we focus on news videos and require QA systems to comprehend and answer questions about the topics presented by combining visual and textual cues in the video. We introduce the “NewsVideoQA” dataset that comprises more than 8, 600 QA pairs on 3, 000+ news videos obtained from diverse news channels from around the world. We demonstrate the limitations of current Scene Text VQA and VideoQA methods and propose ways to incorporate scene text information into VideoQA methods.

 

UAV-based Visual Remote Sensing for Automated Building Inspection (UVRSABI)

 

Kushagra Srivastava , Dhruv Patel , Aditya Kumar Jha , Mohit Kumar Jha, Jaskirat Singh, Ravi Kiran Sarvadevabhatla, Harikumar Kandath, Pradeep Kumar Ramancharla, K. Madhava Krishna,

[Paper]     [Documentation]      [GitHub]

 

overview

 

Architecture of automated building inspection using the aerial images captured using UAV. The odometry information of UAV is also used for the quantification of different parameters involved in the inspection.

 


Overview

  • We automate the inspection of buildings through UAV-based image data collection and a post-processing module to infer and quantify the details which helps in avoiding manual inspection, reducing the time and cost.
  • We introduced a novel method to estimate the distance between adjacent buildings and structures.
  • We developed an architecture that can be used to segment roof tops in case of both orthogonal and non-orthogonal view using a state-of-the-art semantic segmentation model.
  • Taking into consideration the importance of civil inspection of buildings we introduced a software library that helps in estimating the Distances between Adjacent Buildings, Plan-shape of a Building, Roof Area, Non-Structural Elements (NSE) on the rooftop, and the Roof Layou

 

Modules

In order to estimate the seismic structural parameters of the buildings the following modules have been introduced:

  • Distance between Adjacent Buildings
  • Plan Shape and Roof Area Estimation
  • Roof Layout Estimation

 

Distance between Adjacent Buildings

 

DistanceModule

This module provide us the distance between two adjacent buildings. We sampled the images from the videos captured by UAV and perform panoptic segmentation using state-of-art deep learning model, eliminating vegetation (like trees) from the images. The masked images are then fed to a state-of-the art image-based 3D reconstruction library which outputs a dense 3D point cloud. We then apply RANSAC for fitting planes between the segmented structural point cloud. Further, the points are sampled on these planes to calculate the distance between the adjacent buildings at different locations.

 

Results: Distance between Adjacent Buildings

Results DistanceModule 1

Sub-figures (a), (b) and (c) and (d), (e) and (f) represent the implementation of plane fitting using piecewise-RANSAC in different views for two subject buildings.

 

Plan Shape and Roof Area Estimation

PlanShape

This module provides information regarding the shape and roof area of the building. We segment the roof using a state-of-the-art semantic segmentation deep learning model. We also subjected the input images to a pre-processing module that removes distortions from the wide-angle images. Data augmentation was used to increase the robustness and performance. Roof Area was calculated using the focal length of the camera, the height of the drone from the roof and the segmented mask area in pixels.

 

Results: Plan Shape and Roof Area Estimation

Results PlanShape

This figure represents the roof segmentation results for 4 subject buildings.

 

Roof Layout Estimation

RoofLaoutEstimation

This module provides information about the roof layout. Since it is not possible to capture the whole roof in a single frame specially in the case of large sized buildings, we perform large scale image stitching of partially visible roofs followed by NSE detection and roof segmentation.

 

Results: Roof Layout Estimation

 imagestitchingoutput

Stitched Image

 roofmask

Roof Mask

 objectmask

Object Mask

 


Contact

If you have any question, please reach out to any of the above mentioned authors.

Unsupervised Audio-Visual Lecture Segmentation

Darshan Singh S*, Anchit Gupta*, C.V. Jawahar and Makarand Tapaswi

 

CVIT,   IIIT Hyderabad

WACV, 2023

[ Code ]   | [Dataset ] | [ arXiv ] | [ Demo Video ]

 

architecture final

 We address the task of lecture segmentation in an unsupervised manner. We show an example of a lecture segmented using our method. Our method predicts segments close to the ground-truth. Note that our method does not predict the segment labels, they are only shown so that the reader can appreciate the different topics.

Abstract

This  Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introduc- ing video lecture segmentation that splits lectures into bite- sized topics. Lecture clip representations leverage visual, textual, and OCR cues and are trained on a pretext self- supervised task of matching the narration with the temporally aligned visual content. We formulate lecture segmentation as an unsupervised task and use these representations to generate segments using a temporally consistent 1- nearest neighbor algorithm, TW-FINCH. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach.

Paper

  • Paper
    Unsupervised Audio-Visual Lecture Segmentation

    Darshan Singh S, Anchit Gupta, C.V. Jawahar and Makarand Tapaswi
    Unsupervised Audio-Visual Lecture Segmentation, WACV, 2023.
    [PDF ] | [BibTeX]

    Updated Soon

Demo

--- COMING SOON ---


Contact

  1. Darshan Singh S - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Anchit Gupta - This email address is being protected from spambots. You need JavaScript enabled to view it.

More Articles …

  1. PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition
  2. DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games
  3. Compressing Video Calls using Synthetic Talking Heads
  4. Audio-Visual Face Reenactment
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.