Tackling Low Resolution for Better Scene Understanding
Complete scene understanding has been an aspiration of computer vision since its very early days. It has applications in autonomous navigation, aerial imaging, surveillance, human-computer interaction among several other active areas of research. While many methods since the advent of deep learninghave taken performance in several scene understanding tasks to respectable levels, the tasks are far from being solved. One problem that plagues scene understanding is low-resolution. Convolutional Neural Networks that achieve impressive results on high resolution struggle when confronted with low resolution because of the inability to learn hierarchical features and weakening of signal with depth. In this thesis, we study the low resolution and suggest approaches that can overcome its consequences on three popular tasks - object detection, in-the-wild face recognition, and semantic segmentation. The popular object detectors were designed for, trained, and benchmarked on datasets that have a strong bias towards medium and large sized objects. When these methods are finetuned and tested on a dataset of small objects, they perform miserably. The most successful detection algorithms follow a two-stage pipeline: the first which quickly generates regions of interest that are likely to contain the object and the second, which classifies these proposal regions. We aim to adapt both these stages for the case of small objects; the first by modifying anchor box generation based on theoretical considerations, and the second using a simple-yet-effective super-resolution step. Motivated by the success of being able to detect small objects, we study the problem of detecting and recognising objects with huge variations in resolution, in the problem of face recognition in semi-structured scenes. Semi-structured scenes like social settings are more challenging than regular ones: there are several more faces of vastly different scales, there are large variations in illumination, pose and expression, and the existing datasets do not capture these variations. We address the unique challenges in this setting by (i) benchmarking popular methods for the problem of face detection, and (ii) proposing a method based on resolution-specific networks to handle different scales. Semantic segmentation is a more challenging localisation task where the goal is to assign a semantic class label to every pixel in the image. Solving such a problem is crucial for self-driving cars where we need sharper boundaries for roads, obstacles and paraphernalia. For want of a higher receptive field and a more global view of the image, CNN networks forgo resolution. This results in poor segmentation of complex boundaries, small and thin objects. We propose prefixing a super-resolution step before semantic segmentation. Through experiments, we show that a performance boost can be obtained on the popular streetview segmentation dataset, CityScapes.
|Year of completion:
|Prof. C V Jawahar