UAV-based Visual Remote Sensing for Automated Building Inspection (UVRSABI)

Kushagra Srivastava , Dhruv Patel , Aditya Kumar Jha , Mohit Kumar Jha, Jaskirat Singh, Ravi Kiran Sarvadevabhatla, Harikumar Kandath, Pradeep Kumar Ramancharla, K. Madhava Krishna,

[Paper] [Documentation] [GitHub]

overview

Architecture of automated building inspection using the aerial images captured using UAV. The odometry information of UAV is also used for the quantification of different parameters involved in the inspection.

Overview

We automate the inspection of buildings through UAV-based image data collection and a post-processing module to infer and quantify the details which helps in avoiding manual inspection, reducing the time and cost.
We introduced a novel method to estimate the distance between adjacent buildings and structures.
We developed an architecture that can be used to segment roof tops in case of both orthogonal and non-orthogonal view using a state-of-the-art semantic segmentation model.
Taking into consideration the importance of civil inspection of buildings we introduced a software library that helps in estimating the Distances between Adjacent Buildings, Plan-shape of a Building, Roof Area, Non-Structural Elements (NSE) on the rooftop, and the Roof Layou

Modules

In order to estimate the seismic structural parameters of the buildings the following modules have been introduced:

Distance between Adjacent Buildings
Plan Shape and Roof Area Estimation
Roof Layout Estimation

Distance between Adjacent Buildings

DistanceModule

This module provide us the distance between two adjacent buildings. We sampled the images from the videos captured by UAV and perform panoptic segmentation using state-of-art deep learning model, eliminating vegetation (like trees) from the images. The masked images are then fed to a state-of-the art image-based 3D reconstruction library which outputs a dense 3D point cloud. We then apply RANSAC for fitting planes between the segmented structural point cloud. Further, the points are sampled on these planes to calculate the distance between the adjacent buildings at different locations.

Results: Distance between Adjacent Buildings

Results DistanceModule 1

Sub-figures (a), (b) and (c) and (d), (e) and (f) represent the implementation of plane fitting using piecewise-RANSAC in different views for two subject buildings.

Plan Shape and Roof Area Estimation

PlanShape

This module provides information regarding the shape and roof area of the building. We segment the roof using a state-of-the-art semantic segmentation deep learning model. We also subjected the input images to a pre-processing module that removes distortions from the wide-angle images. Data augmentation was used to increase the robustness and performance. Roof Area was calculated using the focal length of the camera, the height of the drone from the roof and the segmented mask area in pixels.

Results: Plan Shape and Roof Area Estimation

Results PlanShape

This figure represents the roof segmentation results for 4 subject buildings.

Roof Layout Estimation

RoofLaoutEstimation

This module provides information about the roof layout. Since it is not possible to capture the whole roof in a single frame specially in the case of large sized buildings, we perform large scale image stitching of partially visible roofs followed by NSE detection and roof segmentation.

Results: Roof Layout Estimation

imagestitchingoutput

Stitched Image

roofmask

Roof Mask

objectmask

Object Mask

Contact

If you have any question, please reach out to any of the above mentioned authors.

Unsupervised Audio-Visual Lecture Segmentation

Darshan Singh S, Anchit Gupta, C.V. Jawahar and Makarand Tapaswi

CVIT, IIIT Hyderabad

WACV, 2023

[ Code ] | [Dataset ] | [ arXiv ] | [ Demo Video ]

architecture final

We address the task of lecture segmentation in an unsupervised manner. We show an example of a lecture segmented using our method. Our method predicts segments close to the ground-truth. Note that our method does not predict the segment labels, they are only shown so that the reader can appreciate the different topics.

Abstract

This Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introduc- ing video lecture segmentation that splits lectures into bite- sized topics. Lecture clip representations leverage visual, textual, and OCR cues and are trained on a pretext self- supervised task of matching the narration with the temporally aligned visual content. We formulate lecture segmentation as an unsupervised task and use these representations to generate segments using a temporally consistent 1- nearest neighbor algorithm, TW-FINCH. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach.

Paper

Unsupervised Audio-Visual Lecture Segmentation

Darshan Singh S, Anchit Gupta, C.V. Jawahar and Makarand Tapaswi
Unsupervised Audio-Visual Lecture Segmentation, WACV, 2023.
[PDF ] | [BibTeX]

Updated Soon

Demo

--- COMING SOON ---

Contact

Darshan Singh S - This email address is being protected from spambots. You need JavaScript enabled to view it.
Anchit Gupta - This email address is being protected from spambots. You need JavaScript enabled to view it.

PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition

Pose-based action recognition is predominantly tackled by approaches which treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. ‘Thumbs up’) or legs (e.g. ‘Kicking’). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters

PSUMNet teaser image

The plot on left shows accuracy against # parameters for our proposed architecture PSUMNet (⋆) and existing approaches for the large-scale NTURGB+D 120 human actions dataset (cross subject).

PSUMNet pipeline diagram2

Comparison between conventional training procedure used in most of the previous approaches (left) and our approach (right).

To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet’s scalability, performance and efficiency makes it an attractive choice for action recognition and for deployment on computerestricted embedded and edge devices.

Supplementary PDF GitHub

PSUMNet architecture diagram 2

Overall Architecture of one stream of the proposed architecture. The input skeleton is passed through Multi modality data generator (MMDG), which generates joint, bone, joint velocity and bone velocity data from input and concatenates each modality data into channel dimension as shown in (b).
This multi-modal data is processed via Spatio Temporal Relational Module (STRM) followed by global average pooling and FC.
Spatio Temporal Relational Block (STRB), where input data is passed through Spatial Attention Map Generator (SAMG) for spatial relation modeling, followed by Temporal Relational Module. As shown in (a) multiple STRB stacked together make the STRM.
Spatial Attention Map Generator (SAMG), dynamically models adjacency matrix (Ahyb)to model spatial relations between joints. Predefined adjacency matrix (A) is used for regularization.
Temporal Relational Module (TRM) consists of multiple temporal convolution blocks in parallel. Output of each temporal convolution block is concatenated to generate final features.

DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games

Nikhil Bansal Kartik Gupta Kiruthika Kannan Sivani Pentapati Ravi Kiran Sarvadevabhatla

ACMMM 202

PAPER DATASET CODE

What is atypical sketch content and why do we need to detect them?

Pictionary, the popular sketch-based game forbids drawer from writing text(atypical content) on canvas. Intervention of such rule violations is impractical and not scalable in web-based online setting of this game involving large number of multiple concurrent sessions. Apart from malicious game play, atypical sketch content can also exist in non-malicious, benign scenarios. For instance, the Drawer may choose to draw arrows and other such icons to attract the Guesser’s attention and provide indirect hints regarding the target word. Accurately localizing such activities can aid statistical learning approaches which associate sketch-based representations with corresponding target words.

AtyPict- the first ever dataset of atypical whiteboard content

The categories of atypical content usually encountered in Pictionary sessions are:

Text: Drawer directly writes the target word or hints related to the target word on the canvas.
Numerical: Drawer writes numbers on canvas.
Circles: Drawers often circle a portion of the canvas to emphasize relevant or important content.
Iconic: Other items used for emphasizing content and abstract compositional structures include drawing a question mark, arrow and other miscellaneous structures (e.g. double-headed arrow, tick marks, addition symbol, cross) and striking out the sketch (which usually implies negation of thesketched item).

Multiclass Samples

Examples of atypical content detection. False negatives are shown as dashed rectangles and false positives as dotted rectangles. Color codes are: text, numbers, question marks, arrows, circles and other icons (e.g. tick marks, addition symbol).

pictdraw

Screenshots of our data collection tool showing Drawer (left) and Guesser (right) activity during a Pictionary game. In this case, the Drawer has violated the game rules by writing text (`Spiderm') on the canvas. An automatic alert notifying the player (see top left of screenshot) and identifying the text location (red box on canvas) is generated by our system DrawMon.

CanvasDash: an intuitive dashboard UI for annotation and visualization

labelling tool

An illustration of annotation using our Canvas-Dash interface.

atypict stats2

The distribution of atypical content categories show significant imbalance with category 'Individual letters' occurring more often than others.

DrawMon: a distributed system for sketchcontent-based alert generation

DrawMon - a distributed alert generation system (see figure below). Each game session is managed by a central Session Manager which assigns a unique session id.

For a given session, whenever a sketch stroke is drawn, the accumulated canvas content (i.e. strokes rendered so far) is tagged with session id and relayed to a shared Session Canvas Queue.
For efficiency, the canvas content is represented as a lightweight Scalable Vector Graphic (SVG) object. The contents of the Session Canvas Queue are dequeued and rendered into corresponding 512×512 binary images by Distributed Rendering Module in a distributed and parallel fashion.
The rendered binary images tagged with session id are placed in the Rendered Image Queue. The contents of Rendered Image Queue are dequeued and processed by Distributed Detection Module. Each Detection module consists of our custom-designed deep neural network CanvasNet.

CanvasNet: a model for detecting atypical sketch instances

CanvasNet processes the rendered image as input and outputs a list of atypical activities (if any) along with associated meta-information (atypical content category, 2-D spatial location).

CanvasNet

DrawMon in Action

Paper

PDF: Paper
arXiv: Coming soon!
ACMMM-2022: Coming soon!

Code

The code for this work is available on GitHub!
Link: pictionary-cvit/drawmon

Acknowledgements

We wish to acknowledge grant from KCIS - TCS foundation.

Bibtex

Please consider citing the following works if you make use of our work:

@InProceedings{DrawMonACMMM2022,
author="Bansal, Nikhil
and Gupta, Kartik
and Kannan, Kiruthika
and Pentapati, Sivani
and Sarvadevabhatla, Ravi Kiran",
title="DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games",
booktitle = "ACM conference on Multimedia (ACMMM)",
year="2022"
}

Compressing Video Calls using Synthetic Talking Heads

Madhav Agarwal, Anchit Gupta , Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad Univ. of Bath

BMVC, 2022

[ Interactive Demo ] | [ Demo Video ]

architecture final

We depict the entire pipeline used for compressing talking head videos. In our pipeline, we detect and send key points of alternate frames over the network and regenerate the talking heads at the receiver’s end. We then use frame interpolation to generate the rest of the frames and use super-resolution to generate high-resolution outputs

Abstract

We leverage the modern advancements in talking head generation to propose an end-to-end system for talking head video compression. Our algorithm transmits pivot frames intermittently while the rest of the talking head video is generated by animating them. We use a state-of-the-art face reenactment network to detect key points in the non-pivot frames and transmit them to the receiver. A dense flow is then calculated to warp a pivot frame to reconstruct the non-pivot ones. Transmitting key points instead of full frames leads to significant compression. We propose a novel algorithm to adaptively select the best-suited pivot frames at regular intervals to provide a smooth experience. We also propose a frame-interpolater at the receiver’s end to improve the compression levels further. Finally, a face enhancement network improves reconstruction quality, significantly improving several aspects like the sharpness of the generations. We evaluate our method both qualitatively and quantitatively on benchmark datasets and compare it with multiple compression techniques. A demo video is attached to the supplementary, providing qualitative results

Paper

Compressing Video Calls using Synthetic Talking Heads

Madhav Agarwal, Anchit Gupta, Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar
The 33rd British Machine Vision Conference, BMVC, 2022.
[PDF ] | [BibTeX]

@inproceedings{compressing2022bmvc,
title={Compressing Video Calls using Synthetic Talking Heads},
author={Agarwal, Madhav and Gupta, Anchit and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P and Jawahar, CV},
booktitle={British Machine Vision Conference (BMVC)},
year={2022} }

Demo

Contact

Madhav Agarwal - This email address is being protected from spambots. You need JavaScript enabled to view it.
Rudrabha Mukhopadhyay - This email address is being protected from spambots. You need JavaScript enabled to view it.

UAV-based Visual Remote Sensing for Automated Building Inspection (UVRSABI)

Kushagra Srivastava , Dhruv Patel , Aditya Kumar Jha , Mohit Kumar Jha, Jaskirat Singh, Ravi Kiran Sarvadevabhatla, Harikumar Kandath, Pradeep Kumar Ramancharla, K. Madhava Krishna,

[Paper] [Documentation] [GitHub]

Overview

Modules

Distance between Adjacent Buildings

Results: Distance between Adjacent Buildings

Plan Shape and Roof Area Estimation

Results: Plan Shape and Roof Area Estimation

Roof Layout Estimation

Results: Roof Layout Estimation

Stitched Image

Roof Mask

Object Mask

Contact

Unsupervised Audio-Visual Lecture Segmentation

Darshan Singh S*, Anchit Gupta*, C.V. Jawahar and Makarand Tapaswi

CVIT, IIIT Hyderabad

WACV, 2023

[ Code ] | [Dataset ] | [ arXiv ] | [ Demo Video ]

Abstract

Paper

Unsupervised Audio-Visual Lecture Segmentation

Demo

Contact

PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition

DrawMon: A Distributed System for Detection of Atypical Sketch Content in Concurrent Pictionary Games

What is atypical sketch content and why do we need to detect them?

AtyPict- the first ever dataset of atypical whiteboard content

CanvasDash: an intuitive dashboard UI for annotation and visualization

An illustration of annotation using our Canvas-Dash interface.

DrawMon: a distributed system for sketchcontent-based alert generation

CanvasNet: a model for detecting atypical sketch instances

DrawMon in Action

Paper

Code

Acknowledgements

Bibtex

Compressing Video Calls using Synthetic Talking Heads

Madhav Agarwal, Anchit Gupta , Rudrabha Mukhopadhyay, Vinay Namboodiri and C.V. Jawahar

IIIT Hyderabad Univ. of Bath

BMVC, 2022

[ Interactive Demo ] | [ Demo Video ]

Abstract

Paper

Compressing Video Calls using Synthetic Talking Heads

Demo

Contact

More Articles …

Darshan Singh S, Anchit Gupta, C.V. Jawahar and Makarand Tapaswi