Thesis Students

Towards Generalization in Multi-View Pedestrian Detection

Jeet Vora

Abstract

Detecting humans in images and videos has emerged as an essential aspect of intelligent video systems that solve pedestrian detection, tracking, crowd counting, etc. It has many real-life applications varying from visual surveillance and sports to autonomous driving. Despite achieving high performance, the single camera-based detection methods are susceptible to occlusions caused by humans, which drastically degrades the performance where crowd density is very high. Therefore multi-camera setup becomes necessary, which incorporates multiple camera views for detections by computing precise 3D locations that can be visualized and transformed to Top View also termed as Bird’s Eye View (BEV) representation and thus permits better occlusion reasoning in crowded scenes. The thesis, therefore, presents a multi-camera approach that globally aggregates the multi-view cues for detection and alleviates the impact of occlusions in a crowded environment. But it was still primarily unknown how satisfactorily the multi-view detectors generalize to unseen data. In different camera setups, this becomes critical because a practical multi-view detector should be usable in scenarios such as i) when the model trained with few camera views is deployed, and one of the cameras fails during testing/inference or when we add more camera views to the existing setup, ii) when we change the camera positions in the same environment and finally iii) when deploying the system on the unseen environment; an ideal multi-camera setup system should be adaptable to such changing conditions. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. We formalized three critical forms of generalization and outlined the experiments to evaluate them: generalization with i) a varying number of cameras, ii) varying camera positions, and finally, iii) to new scenes. We discover that existing state-of-the-art models show poor generalization by overfitting to a single scene and camera configuration. To address the concerns: (a) we generated a novel Generalized MVD (GMVD) dataset, assimilating diverse scenes with changing daytime, camera configurations, varying number of cameras, and (b) we discuss the properties essential to bring generalization to MVD and developed a barebones model to incorporate them. We performed a series of experiments on the WildTrack, MultiViewX, and the GMVD datasets to motivate the necessity to evaluate the generalization abilities of MVD methods and to demonstrate the efficacy of the developed approach.

Year of completion:	April 2023
Advisor :	Vineet Gandhi

Related Publications

Downloads

Does Audio help in deep Audio-Visual Saliency prediction models?

Ritvik Agrawal

Abstract

The task of saliency prediction focuses on understanding and modeling human visual attention (HVA), i.e., where and what people pay attention to given visual stimuli. Audio has ideal properties to aid gaze fixation while viewing a scene. There exists substantial evidence of audio-visual interplay in human perception, and it is agreed that they jointly guide our visual attention. Learning computational models for saliency estimation is an effort to inch machines/robots closer to human cognitive abilities. The task of saliency prediction is helpful in many digital content-based applications like automated editing, perceptual video coding, human-robot interactions,etc. The field has progressed from using hand-crafted features to deep learning-based solutions. Efforts on static image saliency prediction methods are led by convolutional architectures. The ideas were extended to videos by integrating temporal information using 3D convolutions or LSTM’s. Many sophisticated multimodal, multi-stream architectures have been proposed to process multimodal information for saliency prediction. Despite existing works of Audio-Visual Saliency Prediction (AVSP) models claiming to achieve promising results by fusing audio modality over visual-only models, most of these models only consider visual cues and fail to leverage auditory information that is ubiquitous in dynamic scenes. In this thesis, we investigate the relevance of audio cues in conjunction with the visual ones and conduct extensive analysis to analyse the cause of AVSP models being superior by employing well-established audio modules and fusion techniques from diverse correlated audio-visual tasks. Our analysis on ten diverse saliency datasets suggests that none of the methods worked for incorporating audio. Our endeavour suggests that augmenting audio features ends up learning a predictive model agnostic to audio . Furthermore, we bring to light, why AVSP models show a gain in performance over visual-only models, though the audio branch is agnostic at inference. Our experiments clearly indicate that visual modality dominates the learning; the current models largely ignore the audio information. The observation is consistent while using three different audio backbones and four different fusion techniques and contrasts with the previous methods, which claim audio as a significant contributing factor. The performance gains are a byproduct of improved training and the additional audio branch seems to have a regularizing effect. We show that similar gains are achieved while sending random audio during training. Overall our work questions the role of audio in current deep AVSP models and motivates the community to a clear avenue for reconsideration of the complex architectures by demonstrating that simpler alternatives work equally well.

Year of completion:	April 2023
Advisor :	Vineet Gandhi

Related Publications

Downloads

Development of Annotation Guidelines, Datasets and Deep Networks for Palm Leaf Manuscript Layout Understanding

Sowmya Aitha

Abstract

Ancient paper documents and palm leaf manuscripts from the Indian subcontinent have made a significant contribution to the world literary and culture. These documents often have complex, uneven, and irregular layouts. The process of digitization and deciphering the content from these documents without human intervention pose difficulties in a broad range of areas, including language, script, layout, elements, position, and number of manuscripts per image. Large-scale annotated Indic manuscript image datasets are needed for this kind of research. In order to meet this objective, we present Indiscapes, the first dataset containing multi-regional layout annotations for ancient Indian manuscripts. We also adapt a fully convolutional deep neural network architecture for fully automatic, instance-level spatial layout parsing of manuscript images in order to deal with the challenges such as presence of dense, irregular layout elements, pictures, multiple documents per image and the wide variety of scripts. Eventually, We demonstrate the effectiveness of proposed architecture on images from the Indiscapes dataset. Despite advancements, the segmentation of semantic layout using typical deep network methods is not resistant to the complex deformations that are observed across semantic regions. This problem is particularly evident in the domain of Indian palm-leaf manuscripts, which has limited resources. Therefore, we present Indiscapes2, a new expansive dataset of various Indic manuscripts with semantic layout annotations, to help address the issue. Indiscapes2 is 150% larger than Indiscapes and contains materials from four different historical collections. In addition, we propose a novel deep network called Palmira for reliable, deformation-aware region segmentation in handwritten manuscripts. As a performance metric, we additionally report a boundary-centric measure called Hausdorff distance and its variations. Our tests show that Palmira offers reliable layouts and outperforms both strong baseline methods and ablative versions. We also highlight our results on Arabic, South-East Asian and Hebrew historical manuscripts to showcase the generalization capability of PALMIRA. Even though we have reliable deep-network based approaches for comprehending manuscript layout, these models implicitly assume one or two manuscripts per image during the process, whereas in a real-world scenario there are often cases where multiple manuscripts are typically scanned together into a scanned image to maximise scanner surface area and reduce manual labour. Now, making sure that each individual manuscript within a scanned image can be isolated (segmented) on a per-instance basis became the first essential step in understanding the content of a manuscript. Hence, there is a need for a precursor system which extracts individual manuscripts before downstream processing. The highly curved and deformed boundaries of manuscripts, which frequently cause them to overlap with each other, introduce another complexity when confronting issue. We introduce another new document image dataset named IMMI (Indic Multi Manuscript Images) to address these issues. We also present a method that generates synthetic images to augment sourced non-synthetic images in order to boost the efficiency of the dataset and facilitate deep network training. Adapted versions of current document instance segmentation frameworks are used in our experiments. The results demonstrate the efficacy of the new frameworks for the task. Overall, our contributions enable robust extraction of individual historical manuscript pages. This in turn, could potentially enable better performance on downstream tasks such as region-level instance segmentation, optical character recognition and word-spotting in historical Indic manuscripts at scale.

Year of completion:	May 2023
Advisor :	Ravi Kiran Sarvadevabhatla

Related Publications

Downloads

Situation Recognition for Holistic Video Understanding

Zeeshan Kha

Abstract

Video is a complex modality consisting of multiple events, complex action, humans, objects and their interactions densely entangled over time. Understanding videos has been the core and one of the most challenging problem in computer vision and machine learning. What makes it even harder is the lack of structured formulation of the task specially when long videos are considered consisting of multiple events and diverse scenes. Prior works in video understanding have tried to address the problem only in a sparse and a uni-dimensional way, for example action recognition, spatio-temporal grounding, question answering and, free form captioning. However it requires holistic understanding to fully capture all the events, actions, and relations between all the entities, and represent any natural scene with the highest detail in the most faithful way. It requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) through semantic role labeling is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This is one of the most dense video understanding task posing several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation due to the free form captions for representing the roles. In this work, we propose the addition of spatio-temporal grounding as an essential component of the structured prediction task in a weakly supervised setting, without requiring ground truth bounding boxes. Since evaluating free-form captions can be difficult and imprecise this not only improves the current formulation and the evaluation setup, but also improves the interpretability of the models decision, because grounding allows us to visualise where the model is looking while generating a caption. To this end we present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time

Year of completion:	May 2023
Advisor :	C V Jawahar , Makarand Tapaswi

Related Publications

Downloads

Computer Vision based Large Scale Urban Mobility Audit andParametric Road Scene Parsing

Durga Nagendra Raghava Kumar Modhugu

Abstract

The footprint of partial or fully autonomous vehicles is increasing gradually with time. The existenceand availability of the necessary modern infrastructure are crucial for the widespread use of autonomousnavigation. One of the most critical efforts in this direction is to build and maintain HD maps efficientlyand accurately. The information in HD maps is organized in various levels 1) Geometric layer, 2)Semantic layer and 3) Map prior’s layer. The conventional approaches to capturing and extractinginformation at different HD map levels rely heavily on huge sensor networks and manual annotation.This is not scalable to create HD maps for massive road networks. We propose two novel solutionsto address the mentioned problems in this work. The first solution deals with the generation of thegeometric layer with parametric information of the road scene and other one to update information onroad infrastructure and traffic violations in the semantic layer.Firstly, the creation of the geometric layer of the HD map requires understanding the road layout interms of structure, number of lanes, lane width, curvature, etc. Prediction of these attributes as part ofa generalizable parametric model with which road layout can be rendered would suite the creation of ageometric layer. Many previous works that tried to solve this problem rely only on ground imagery andare limited by the narrow field of view of the camera, occlusions, and perspective shortening. This workdemonstrates the effectiveness of using aerial imagery as an additional modality to overcome the abovechallenges. We propose a novel architecture, Unified, that combines aerial and ground imagery featuresto infer scene attributes. We quantitatively evaluate on the KITTI dataset and show that our Unifiedmodel outperforms prior works. Since this dataset is limited to road scenes close to the vehicle, we sup-plement the publicly available Argoverse dataset with scene attribute annotations and evaluate far-awayscenes. We quantitatively and qualitatively show the importance of aerial imagery in understanding roadscenes, especially in regions farther away from the ego-vehicle.Finally, we also propose a simple mobile imaging setup to address and audit several common prob-lems in urban mobility and road safety, which can enrich the information in a semantic layer of HDmaps. Recent computer vision techniques are used to identify street irregularities (including missinglane markings and potholes), absence of street lights, and defective traffic signs using videos obtainedfrom a moving camera-mounted vehicle. Beyond the inspection of static road infrastructure, we alsodemonstrate the applicability of mobile imaging solutions to spot traffic violations. We validate ourproposal on the long stretches of unconstrained road scenes covering over 2000Km and discuss practi-cal challenges in applying computer vision techniques at such a scale. Exhaustive evaluation is carried viiout on 257 long-stretches with unconstrained settings and 20 conditions-based hierarchical frame-levellabels for different timings, weather conditions, road type, traffic density, and state of road damage. Forthe first time, we demonstrate that large-scale analytics of irregular road infrastructure is feasible withexisting computer vision techniques.

Year of completion:	December 2022
Advisor :	C V Jawahar

Towards Generalization in Multi-View Pedestrian Detection

Jeet Vora

Abstract

Related Publications

Downloads

Does Audio help in deep Audio-Visual Saliency prediction models?

Ritvik Agrawal

Abstract

Related Publications

Downloads

Development of Annotation Guidelines, Datasets and Deep Networks for Palm Leaf Manuscript Layout Understanding

Sowmya Aitha

Abstract

Related Publications

Downloads

Situation Recognition for Holistic Video Understanding

Zeeshan Kha

Abstract

Related Publications

Downloads

Computer Vision based Large Scale Urban Mobility Audit andParametric Road Scene Parsing

Durga Nagendra Raghava Kumar Modhugu

Abstract

Related Publications

Downloads

More Articles …