Deep-Learning Features, Graphs and Scene Understanding

Abhijeet Kumar

Abstract

Scene Understanding has been a major aspiration of computer vision from its early days. Its root lies in enabling the computer/robot/machine to understand, interpret and manipulate visual data, in similarity to what an average human eye does in front of a natural/artificial localized location/scene. This ennoblement of the machine have a widespread impact ranging from Surveillance, Aerial Imaging, Autonomous Navigation, Smart Cities and thus scene understanding have remained as an active area of research in the last decade. In the last decade, the scope of problems in the scene understanding community has broadened from Image Annotation, Image Captioning, Image Segmentation to Object Detection, Dense Image Captioning, Instance Segmentation etc. Advanced problems like Autonomous Navigation, Panoptic Segmentation, Video Summarization, Multi-Person Tracking in Crowded Scenes have also surfaced in this arena and are being vigorously attempted. Deep Learning has played a major role in this advancement/development. The performance metrics in some of these tasks have more than tripled in the last decade itself but these tasks remain far from solved. Success originating from deep learning can be attributed to the learned features. In simple words, features learned from a Convolutional Neural Network trained for annotation are in general far more suited for captioning then a non-deep learning method trained for captioning. Taking cue from this particular deep learning trend, we dived into the domain of scene understanding with the focus on utilization of prelearned-features from other similar domains. We focus on two tasks in particular: Automatic (multi-label)Image Annotation and (Road)Intersection Recognition. Automatic image annotation is one of the earliest problems in scene understanding and refers to the task of assigning (multiple) labels to an image based on its content. Whereas intersection recognition is the outcome of the new era of problems in scene understanding and it refers to the task of identifying an intersection from varied viewpoints in varied weather and lighting conditions. We focused on this significantly varied task approach to broaden the scope and generalizing capability of the results we compute. Both image annotation and intersection recognition pose some common challenges such as occlusion, perspective variations, distortions etc. While focusing on the image annotation task we further narrowed our domain by focusing on graph based methods. We again chose two different paradigms: a multiple kernel learning based non-deep learning approach and a deep-learning based approach, with a focus on bringing out contrast again. Through quantitative and qualitative results we show slightly boosted performance from the above mentioned paradigms. The intersection recognition task is relatively new in the field. Most of the work in field focuses on Places Recognition which utilized only single images. We focus on temporal information i.e. the traversal of the intersection/places as seen from a camera mounted on a vehicle. Through experiments we show a performance boost in intersection recognition from the inclusion of temporal information

Year of completion:	January 2020
Advisor :	Avinash Sharma