CVIT Home CVIT Home
  • Home
  • People
    • Faculty
    • Staff
    • PhD Students
    • MS Students
    • Alumni
  • Research
    • Publications
    • Thesis
    • Projects
    • Resources
  • Events
    • Talks and Visits
    • Major Events
    • Visitors
    • Summer Schools
  • Gallery
  • News & Updates
    • News
    • Blog
    • Newsletter
    • Banners
  • Contact Us
  • Login

INR-V: A Continuous Representation Space for Video-based Generative Tasks


Bipasha Sen*1 Aditya Agarwal*1, Vinay P Namboodiri2 and C.V. Jawahar

1IIIT Hyderabad, India

2University of Bath, UK

* indicates equal contribution

TMLR, 2022

[ Paper ]   | [ Video ] | [ Inference Code ] | [ OpenReview ]

 

 

 

 

 

 

Abstract

banner. V18

Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This limits the expressivity of videos to only image-based operations on the individual video frames needing network designs to obtain temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. In this work, we evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showing the potential of the proposed representation space.
 

Overview

architecture overview

We parameterize videos as a function of space and time using implicit neural representations (INRs). Any point in a video Vhwt can be represented by a function fΘ→ RGBhwt where t denotes the tth frame in the video and h, w denote the spatial location in the frame, and RGB denotes the color at the pixel position {h, w, t}. Subsequently, the dynamic dimension of videos (a few million pixels) is reduced to a constant number of weights Θ (a few thousand) required for the parameterization. A network can then be used to learn a prior over videos in this parameterized space. This can be obtained through a meta-network that learns a function to map from a latent space to a reduced parameter space that maps to a video. A complete video is thus represented as a single latent point. We use a meta-network called hypernetworks that learns a continuous function over the INRs by getting trained on multiple video instances using a distance loss. However, hypernetworks are notoriously unstable to train, especially on the parameterization of highly expressive signals like videos. Thus, we propose key prior regularization and a progressive weight initialization scheme to stabilize the hypernetwork training allowing it to scale quickly to more than 30,000 videos. The learned prior enables several downstream tasks such as novel video generation, video inversion, future segment prediction, video inpainting, and smooth video interpolation directly at the video level.

 

comp1

 comp2

inv

inpainting

inversion other

superresolve

add1

 

Additional Interpolation Results

 grid 0

grid 1

grid 2

grid 3

grid 4

grid 5

grid 6

grid 7

grid 8

 

Citation

@article{ sen2022inrv,
   title={ {INR}-V: A Continuous Representation Space for Video-based Generative Tasks},
   author={Bipasha Sen and Aditya Agarwal and Vinay P Namboodiri and C.V. Jawahar},
   journal={Transactions on Machine Learning Research},
   year={2022},
   url={https://openreview.net/forum?id=aIoEkwc2oB},
   note={}
} 

 

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. It is borrowing the source code of this website.

 


Watching the News: Towards VideoQA Models that can Read

Soumya Jahagirdar† , Minesh Mathew† , Dimosthenis Karatzas‡ , C. V. Jawahar† ,

†CVIT, IIIT Hyderabad, India
‡Computer Vision Center, UAB, Spain

[Paper]       [Video]     [Code]

 

 

Abstract

teaser abstract faceoff

We address the task of text based Video Question Answering, incorporating VideoText (VideoText is the textual content embedded in the videos) information (bottom right). We propose a new dataset of News Videos along with QA annotations grounded on video text, and explore VQA models that jointly reason over temporal and text based information.

 

Video Question Answering methods focus on commonsense reasoning and visual cognition of objects or persons and their interactions over time. Current VideoQA approaches ignore the textual information present in the video. Instead, we argue that textual information is complementary to the action and provides essential contextualisationcues to the reasoning process. To this end, we propose a novel VideoQA task that requires reading and understanding the text in the video. To explore this direction, we focus on news videos and require QA systems to comprehend and answer questions about the topics presented by combining visual and textual cues in the video. We introduce the “NewsVideoQA” dataset that comprises more than 8, 600 QA pairs on 3, 000+ news videos obtained from diverse news channels from around the world. We demonstrate the limitations of current Scene Text VQA and VideoQA methods and propose ways to incorporate scene text information into VideoQA methods.

 

UAV-based Visual Remote Sensing for Automated Building Inspection (UVRSABI)

 

Kushagra Srivastava , Dhruv Patel , Aditya Kumar Jha , Mohit Kumar Jha, Jaskirat Singh, Ravi Kiran Sarvadevabhatla, Harikumar Kandath, Pradeep Kumar Ramancharla, K. Madhava Krishna,

[Paper]     [Documentation]      [GitHub]

 

overview

 

Architecture of automated building inspection using the aerial images captured using UAV. The odometry information of UAV is also used for the quantification of different parameters involved in the inspection.

 


Overview

  • We automate the inspection of buildings through UAV-based image data collection and a post-processing module to infer and quantify the details which helps in avoiding manual inspection, reducing the time and cost.
  • We introduced a novel method to estimate the distance between adjacent buildings and structures.
  • We developed an architecture that can be used to segment roof tops in case of both orthogonal and non-orthogonal view using a state-of-the-art semantic segmentation model.
  • Taking into consideration the importance of civil inspection of buildings we introduced a software library that helps in estimating the Distances between Adjacent Buildings, Plan-shape of a Building, Roof Area, Non-Structural Elements (NSE) on the rooftop, and the Roof Layou

 

Modules

In order to estimate the seismic structural parameters of the buildings the following modules have been introduced:

  • Distance between Adjacent Buildings
  • Plan Shape and Roof Area Estimation
  • Roof Layout Estimation

 

Distance between Adjacent Buildings

 

DistanceModule

This module provide us the distance between two adjacent buildings. We sampled the images from the videos captured by UAV and perform panoptic segmentation using state-of-art deep learning model, eliminating vegetation (like trees) from the images. The masked images are then fed to a state-of-the art image-based 3D reconstruction library which outputs a dense 3D point cloud. We then apply RANSAC for fitting planes between the segmented structural point cloud. Further, the points are sampled on these planes to calculate the distance between the adjacent buildings at different locations.

 

Results: Distance between Adjacent Buildings

Results DistanceModule 1

Sub-figures (a), (b) and (c) and (d), (e) and (f) represent the implementation of plane fitting using piecewise-RANSAC in different views for two subject buildings.

 

Plan Shape and Roof Area Estimation

PlanShape

This module provides information regarding the shape and roof area of the building. We segment the roof using a state-of-the-art semantic segmentation deep learning model. We also subjected the input images to a pre-processing module that removes distortions from the wide-angle images. Data augmentation was used to increase the robustness and performance. Roof Area was calculated using the focal length of the camera, the height of the drone from the roof and the segmented mask area in pixels.

 

Results: Plan Shape and Roof Area Estimation

Results PlanShape

This figure represents the roof segmentation results for 4 subject buildings.

 

Roof Layout Estimation

RoofLaoutEstimation

This module provides information about the roof layout. Since it is not possible to capture the whole roof in a single frame specially in the case of large sized buildings, we perform large scale image stitching of partially visible roofs followed by NSE detection and roof segmentation.

 

Results: Roof Layout Estimation

 imagestitchingoutput

Stitched Image

 roofmask

Roof Mask

 objectmask

Object Mask

 


Contact

If you have any question, please reach out to any of the above mentioned authors.

Unsupervised Audio-Visual Lecture Segmentation

Darshan Singh S*, Anchit Gupta*, C.V. Jawahar and Makarand Tapaswi

 

CVIT,   IIIT Hyderabad

WACV, 2023

[ Code ]   | [Dataset ] | [ arXiv ] | [ Demo Video ]

 

architecture final

 We address the task of lecture segmentation in an unsupervised manner. We show an example of a lecture segmented using our method. Our method predicts segments close to the ground-truth. Note that our method does not predict the segment labels, they are only shown so that the reader can appreciate the different topics.

Abstract

This  Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introduc- ing video lecture segmentation that splits lectures into bite- sized topics. Lecture clip representations leverage visual, textual, and OCR cues and are trained on a pretext self- supervised task of matching the narration with the temporally aligned visual content. We formulate lecture segmentation as an unsupervised task and use these representations to generate segments using a temporally consistent 1- nearest neighbor algorithm, TW-FINCH. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach.

Paper

  • Paper
    Unsupervised Audio-Visual Lecture Segmentation

    Darshan Singh S, Anchit Gupta, C.V. Jawahar and Makarand Tapaswi
    Unsupervised Audio-Visual Lecture Segmentation, WACV, 2023.
    [PDF ] | [BibTeX]

    Updated Soon

Demo

--- COMING SOON ---


Contact

  1. Darshan Singh S - This email address is being protected from spambots. You need JavaScript enabled to view it.
  2. Anchit Gupta - This email address is being protected from spambots. You need JavaScript enabled to view it.

PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition

Pose-based action recognition is predominantly tackled by approaches which treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. ‘Thumbs up’) or legs (e.g. ‘Kicking’). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters

 PSUMNet teaser image

The plot on left shows accuracy against # parameters for our proposed architecture PSUMNet (⋆) and existing approaches for the large-scale NTURGB+D 120 human actions dataset (cross subject).

 PSUMNet pipeline diagram2

Comparison between conventional training procedure used in most of the previous approaches (left) and our approach (right).

 

 

    To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet’s scalability, performance and efficiency makes it an attractive choice for action recognition and for deployment on computerestricted embedded and edge devices.

Supplementary          PDF     GitHub

 PSUMNet architecture diagram 2

  1. Overall Architecture of one stream of the proposed architecture. The input skeleton is passed through Multi modality data generator (MMDG), which generates joint, bone, joint velocity and bone velocity data from input and concatenates each modality data into channel dimension as shown in (b).
  2. This multi-modal data is processed via Spatio Temporal Relational Module (STRM) followed by global average pooling and FC.
  3. Spatio Temporal Relational Block (STRB), where input data is passed through Spatial Attention Map Generator (SAMG) for spatial relation modeling, followed by Temporal Relational Module. As shown in (a) multiple STRB stacked together make the STRM.
  4. Spatial Attention Map Generator (SAMG), dynamically models adjacency matrix (Ahyb)to model spatial relations between joints. Predefined adjacency matrix (A) is used for regularization.
  5. Temporal Relational Module (TRM) consists of multiple temporal convolution blocks in parallel. Output of each temporal convolution block is concatenated to generate final features.

 

More Articles …

  1. DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games
  2. Compressing Video Calls using Synthetic Talking Heads
  3. Audio-Visual Face Reenactment
  4. My View is the Best View: Procedure Learning from Egocentric Videos
  • Start
  • Prev
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • Next
  • End
  1. You are here:  
  2. Home
  3. Research
  4. Projects
  5. CVIT Projects
Bootstrap is a front-end framework of Twitter, Inc. Code licensed under MIT License. Font Awesome font licensed under SIL OFL 1.1.