![]() |
![]() |
![]() |
Kernel methods yield state-of-the-art performance in certain applications such as im- age classification and object detection. However, large scale problems require machine learning techniques of at most linear complexity and these are usually limited to linear kernels. This unfortunately rules out gold-standard kernels such as the generalized RBF kernels (e.g. exponential-χ 2 ). Recently, Maji and Berg [13] and Vedaldi and Zisser- man [20] proposed explicit feature maps to approximate the additive kernels (intersec- tion, χ 2 , etc.) by linear ones, thus enabling the use of fast machine learning technique in a non-linear context. An analogous technique was proposed by Rahimi and Recht [14] for the translation invariant RBF kernels. In this paper, we complete the construction and combine the two techniques to obtain explicit feature maps for the generalized RBF ker- nels. Furthermore, we investigate a learning method using l 1 regularization to encourage sparsity in the final vector representation, and thus reduce its dimension. We evaluate this technique on the VOC 2007 detection challenge, showing when it can improve on fast additive kernels, and the trade-offs in complexity and accuracy.
The Oxford-IIITH team participated in the Trecvid 2010, a competition sponsored by the National Institute of Standards and Technology where the task is content based retrieval of Video. "The main goal of the TREC Video Retrieval Evaluation (TRECVID) is to promote progress in content-based analysis of and retrieval from digital video via open, metrics-based evaluation." The team this year worked on the Trecvid 2010 Semanting Indexing Lite task, using the state of the art methods devised for the task of high level feature extraction in Videos. This task included both scene level and object level categories. Trecvid 2010 used a new source of Training and Testing data- characterized by a high degree of diversity in creator, content, style, production qualities, orginal collection device/encoding, language, as is common in the videos we find on the internet - available under the Creative Commons License from the Internet Archive. The Oxford-IIITH team also handled the annotations and the groundtruth-ing of the Training data, which was a challenging task owing to the sheer volume of the data, approximately 200 hours of Internet Archive Videos.
This course will focus on following topics from software practices which are important in developing softwares in the research world. These elements are of great importance while developing research applications. The course will have both theory and hands on sessions.
Many problems in Computer Vision such as image segmentation, stereo matching, image restoration, panoramic stitching, and object recognition involve inferring the maximum a posteriori (MAP) solution of a probability distribution defined by a discrete MRF or CRF. The MAP solution can be found minimizing an energy or cost function. In the last few years, driven by its applicability, energy minimization has become a very active area of research in several areas of Computer Science (such as Computer Vision, Theory, Machine Learning). Although, minimizing a general MRF energy function is an NP-hard problem, there exist a number of powerful algorithms which compute the globally (or strong locally) optimal solution for a particular family of energy functions in polynomial time. The tutorial is aimed at researchers in Computer Vision and related areas, who wish to use MAP estimation algorithms. Some of the material is based on tutorials given by M. Pawan Kumar and Pushmeet Kohli at ECCV 2008 and ICCV 2009. This tutorial will include the following:
The Oxford/IIIT team is participating in the high-level feature extraction task of TRECVID -2009. The TRECVID challenge is currently in progress and will get over by 10th August 2009. The state-of-the-art techniques in feature extraction, classifcation methods and classifiers shall be integrated to provide a vision-only approach to video retrieval. TRECVID has been becoming challenging ever since due to constant increase in training and testing data size. Training in done using 220 videos of total 100 hours, while test dataset has more than 620 videos of 280 hours. Cleaning of data, annotations, training of classifiers and testing on such large datasets is a real challenge. This years TRECVID high level feature extraction task consists of finding and ranking shots from the videos, having some objects such as boat, bus etc. and some human actions as well like, people eating, dancing etc.
The goal of this work is to align movie videos with their scripts. Videos aligned with scripts allows for i) automatic recognition of faces, expressions, locations, human actions, etc., ii) enable better organisation and browsing of videos and iii) enable text-based search and retrieval. Previous approaches for this problem used timing information from subtitles for alignment (Everingham et al., BMVC 06). We explore the situation where the subtitles are not available, e.g. in the case of silent films or film passages which are non-verbal. Achieving a subtitle-free alignment considerably increases the scope of these applications to more diverse video material. We combine multiple cues, both visual and audio, to align the script with the video. The problem is posed as that of assigning each sentence of the script to the correct shot of the video. Each sentence has certain properties such as the location in which it is spoken, the speaker of the sentence and the spoken text. Similar properties can be extracted from the shots using location, face and speech recognition. Using these cues, the sentences can be matched with the shots. Each cue alone is insufficient for reliable alignment, but when combined, they provide sufficiently strong constraints for a full alignment. The approach was demonstrated on episodes of the TV show Seinfeld, on scenes from Charlie Chaplin's silent movies and Indian films.
The Oxford/IIIT team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, both based on a combination of visual features. One used a SVM classifier using a linear combination of kernels, the other used a random forest classifier. For both methods, we trained all high-level features using publicly available annotations. The advantage of the random forest classifier is the speed of training and testing. In addition, for the people feature, we took a more targeted approach. We used a real-time face detector and an upper body detector, in both cases running on every frame. One run C OXVGG 4 4 was submitted using only Random forest approach for all concepts except two people. It performed best on Dog and hand, and came over the median for all the classes except kitchen and airplane. In the interactive search task, our team came third overall with an mAP of 0.158. The system used was identical to last year with the only change being a source of accurate upper body detections