Efficient SVM based object classification and detection

Sreekanth Vempati (homepage)

Over the last few years, the amount of image and video data present over the internet, as well as in the personal collections has been increasing rapidly. As a result, the need for organizing and searching these vast collections of data efficiently has also increased. This has led to an active research in the areas of content based retrieval from multimedia databases. Description of the content, is built through recognition of the scenes/objects in visual data. There has been a lot of research in these areas over the last few years and are still waiting for critical breakthroughs. For these to be deployable, these solutions should not only be accurate, but also efficient and scalable.

There are two fundamental tasks in visual recognition, namely feature extraction and learning. The feature extraction stage deals with, building a representation for the image/video data and the learning stage deals with, learning a function which can distinguish different classes of data. In this thesis, we focus on building efficient methods for visual content recognition and detection in images and videos. We mainly propose new ideas for the learning stage. For this purpose we start from using the state-of-the-art techniques and then show how our proposed ideas influence the computational time and performance.

Firstly, we show the utility of the state of the art image representations and classification methods for the purpose of large scale semantic video retrieval. We demonstrate this on TRECVID 2008 and TRECVID 2009 datasets containing videos for the retrieval of various scenes, objects and actions. We use Support Vector Machines(SVMs) as classifiers, which have been the popular choice for classification tasks in many fields. They have become popular mainly because of their good generalization capability. For obtaining non-linear decision boundaries, SVMs use a kernel function. This kernel function helps in finding a linear classifier in some high dimensional feature space, without actually computing the higher dimensional vectors. In many situations, we need to use computationally expensive non-linear functions as kernels. On the other hand, linear kernel is computationally inexpensive, however it gives poor results in most of the cases.

Another contribution of this thesis is a method for improving the performance of computationally inexpensive classifiers like linear SVM. For this purpose, we explore the utility of sub-categories, which are the sub groupings present in the feature space of each semantic class. We model these subcategories by using Structural SVM framework. Also, we analyze how the choice of the groupings effect the results and present a method to learn the optimal groupings. We investigate our methods on various synthetic two dimensional datasets and real world datasets namely, VOC 2007 and TRECVID 2008.

Non-linear kernel methods yield state-of-the-art performance for image classification and object detection. However, large scale problems require machine learning techniques of atmost linear complexity and these are usually limited to linear kernels. This unfortunately rules out gold-standard kernels such as the generalized RBF kernels (e.g. exponential-\chi^2 ). Any non-linear kernel helps in computing the inner product in a high dimensional space, different from the input space. This helps in overcoming the explicit computation of the high dimensional vectors. The function which can be used to compute this high dimensional feature vector is called the feature map. But this feature map is hard to compute and is very high dimensional. In the literature, explicit feature finite dimensional feature maps have been proposed to approximate the additive kernels (intersection, \chi^2 ) by linear ones, thus enabling the use of fast machine learning techniques in a non-linear context. Also, an analogous technique was proposed for the translation invariant RBF kernels. As a part of this thesis, we complete the construction and combine the two techniques to obtain explicit feature maps for the generalized RBF kernels. Further- more, we investigate a learning method using l1 regularization to encourage sparsity in the final vector representation, and thus reduce its dimension. We evaluate this technique on the VOC 2007 detection challenge, showing when it can improve on fast additive kernels, and the trade-offs in complexity and accuracy.

Year of completion:  December 2010
 Advisor : Dr. Andrew Zisserman (University of Oxford) and Dr. C. V. Jawahar

Related Publications

  • Sreekanth Vempati, Andrea Vedaldi, Andrew Zisserman and C.V. Jawahar - Generalized RBF Feature Maps for Efficient Detection Proceedings of British Machine Vision Conference (BMVC'10),31 Aug. - 3 Sep. 2010, Aberystwyth, UK. [PDF]

  • Mihir Jain, Sreekanth Vempati, Chandrika Pulla and C.V. Jawahar - Example Based Video Filters Proceedings of the 8th International Conference on Image and Video Retrieval (CIVR 2009), July . 8-10, 2009, Santorini, Greece. [PDF]

  • Sreekanth Vempati, Mihir Jain, Omkar M. Parkhi, C. V. Jawahar, Andrea Vedaldi, Marcin Marszalek, Andrew Zisserman - Oxford/IIIT - TRECVID 2009 - Notebook paperIn proceedings of TRECVID 2009 Workshop, Gaithersburg, Md., USA. (PDF)
  • James Philbin, Manuel Marin-Jimenez, Siddharth Srinivasan and Andrew Zisserman, Mihir Jain, Sreekanth Vempati, Pramod Sankar and C. V. Jawahar, Oxford/IIIT - TRECVID 2008 - Notebook paper, TRECVID 2008 Workshop, Gaithersburg, Md., USA. (PDF)