Efficient Annotation of Objects for Video Analysis

Swetha Sirnam (Home Page)


The field of computer vision is rapidly expanding and has significantly more processing power and memory today, than in previous decades. Video has become one of the most popular visual media for communication and entertainment. In particular, automatic analysis and understanding the content of a video is one of the long-standing goals of computer vision. One of the fundamental problems is to model the appearance and behavior of the objects in videos. Such models mainly depend on the problem definition. Typically, in many scenarios, the change in problem statement is followed by the changes in the annotation and its complexities. Creating large-scale datasets in this scenario using the manual annotation process is monotonous, time-consuming and non-scalable. In order to address this challenge and strive towards practical large scale annotated video datasets, we investigate methods to autonomously learn and adapt object models using temporal information in videos. Even though the vision community has advanced in field of problem solving but data generation and annotation is still a tough problem. Data annotation is expensive, tedious and involves a lot of human efforts. Even after data annotation, it is essential to validate the goodness of annotations, which again is a tiresome process. To address this problem, we investigate methods to autonomously learn and adapt the object models using temporal information in videos. This involves learning robust representations of the video. The aim of this thesis is two-fold, first we propose solutions for efficient and accurate object annotation mechanisms in video sequences and secondly, to raise awareness in the community about the importance and attention it deserves. As our first contribution, we propose an efficient, scalable and accurate object bounding box annotation method for large scale complex video datasets. We focus on minimizing the annotation efforts simultaneously increasing the annotation propagation accuracy to get a precise and tight bounding box around object of interest. Using a self training approach, we propose a combination of semi-automatic initialization method with an energy minimization framework to propagate the annotations. Using an energy minimization system for segmentation gives accurate and tight bounding boxes around the object. We have quantitatively and qualitatively validated the results on publicly available datasets. In the second half, we propose annotation scheme for human pose in video sequences. The proposed model is based on a fully-automatic initialization, from any generic state-of-the-art method. But the initialization is prone to error due to the challenges in video data type. We exploit the availability of redundant information from the redundant data type. The model is build on the temporal smoothness assumption in videos. We formulate the problem as a sequence-to-sequence learning problem, the architecture uses Long Short Term Memory encoder-decoder model to encode the temporal context and annotate the pose. We show results on state-of-the-art datasets.

Year of completion:  June 2018
 Advisor : C V Jawahar,Vineeth Balasubramanian

Related Publications

  • Sirnam Swetha, Vineeth N Balasubramanian and C. V. Jawahar - Sequence-to-Sequence Learning for Human Pose Correction in Videos 4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China, 2017. [PDF]

  • Sirnam Swetha, Anand Mishra, Guruprasad M. Hegde and C. V. Jawahar - Efficient Object Annotation for Surveillance and Automotive Applications - Proceedings of the IEEE Winter Conference on Applications of Computer Vision Workshop (WACVW 2016), March 7-9, 2016. [PDF]

  • Rajat Aggarwal, Sirnam Swetha, Anoop M. Namboodiri, Jayanthi Sivaswamy, C. V. Jawahar - Online Handwriting Recognition using Depth Sensors Proceedings of the 13th IAPR International Conference on Document Analysis and Recognition, 23-26 Aug 2015 Nancy, France. [PDF]