Extending Visual Object Tracking for Long Time Horizons

Abhinav Moudgil

Abstract

Visual object tracking is a fundamental task in computer vision and is a key component in wide range of applications like surveillance, autonomous navigation, video analysis and editing, augmented reality etc. Given a target object with bounding box in the first frame, the goal in visual object tracking is to track the given target in the subsequent frames. Although significant progress has been made in this domain to address various challenges like occlusion, scale change etc., we observe that tracking on a large number of short sequences as done in previous benchmarks does not clearly bring out the competence or potential of a tracking algorithm. Moreover, even if a tracking algorithm works well on challenging small sequences and fails on moderately difficult long sequences, it will be of limited practical importance since many tracking applications rely on precise long-term tracking. Thus, we extend the problem of visual object tracking for long time horizons systematically in this thesis. First, we first introduce a long-term visual object tracking benchmark. We propose a novel largescale dataset, specifically tailored for long-term tracking. Our dataset consists of high resolution, densely annotated sequences, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and train better deep learning architectures (avoiding/reducing augmentation, which may not reflect real world behaviour). We also propose a novel metric for long-term tracking which captures the ability of a tracker to track consistently for long duration. We benchmark 17 state of the art trackers on our dataset and rank them according to several evaluation metrics and run time speeds. Next, we analyze the long-term tracking performance of state of the art trackers in depth. We focus on the three key aspects of long-term tracking: Re-detection, Recovery and Reliability. Specifically, we (a) test re-detection capability of the trackers in the wild by simulating virtual cuts, (b) investigate the role of chance in recovery of tracker post failure and (c) propose a novel metric allowing visual inference on the contiguous and consistent aspect of tracking. We present several insights derived from an extensive set of quantitative and qualitative experiments. Lastly, we present a novel fully convolutional anchor free siamese framework for visual object tracking. Previous works utilized anchor based region proposal networks to improve the performance of siamese correlation based trackers while maintaining real-time speed. However, we show that enumerating multiple boxes at each keypoint location in the search region is inefficient and unsuitable for the task of single object tracking, where we just need to locate one target object. Thus, we take an alternate approach by directly regressing box offsets and sizes for keypoint locations in the search region. This proposed approach, dubbed SiamReg, is fully convolutional, anchor free, lighter in weight and improves target localization. We train our framework end-to-end with Generalized IoU loss for bounding box regression and cross entropy loss for target classification. We perform several experiments on standard tracking benchmarks to demonstrate the effectiveness of our approach.

Year of completion:	September 2019
Advisor :	Vineet Gandhi