CVIT Projects

Image Annotation

Motivation

In many real-life scenarios, an object can be categorized into multiple categories. E.g., a newspaper column can be tagged as "political", "election", "democracy"; an image may contain "tiger", "grass", "river"; and so on. These are instances of multi-label classification, which deals with the task of associating multiple labels with single data. It is a difficult problem becauses one needs to consider the intricate correlations that exist amont different labels.

Automatic image annotation is a multi-label classification problem that aims at associating a set of textual with an image that describe its semantics. It has potential applications in image retrieval, image description, etc. Recent outburst of multimedia content on the Internet and as personal collections has raised the demand for auto-annotation methods; due to which this has become an active area of research.

A Modified KNN for Image Annotation [1]

ex 1

{bear, reflection, water, black, river}

ex 2

{field, horses, mare, foals, tree}

ex 3

{green, phone, woman, hair, suit}

ex 4

{fight, grass, game, anime, man}

ex 5

{building, base, horse, statue, man}

ex 6

{fence, mountain, range, airplane, sky}

For a given image, the labels are usually predicted from an annotation vocabulary of few hundred labels. Because of the large vocabulary, there is high variance in label frequency ("class-imbalance"). Moreover, due to limitations of manual annotation, a significant number of available images are not annotated with all the relevant labels ("weak-labelling"). These two issues affect the performance of many existing image annotation models.
In this work, we proposed 2PKNN, a two-step variant of the classical K-nearest neighbour algorithm, that triest to address these two issues. We also proposed a metric learning framework over 2PKNN for learning better distances.

Generating Image Description [2]

Problem

problem

Results


* A black ferrari is parked in front of a green tree.	* An adult hound is laying on an orange couch.	* A blond woman is posing with an elvis impersonator.	* A small sailboat is passing near a yellow buoy.
* A sporty car is parked on a concrete driveway.	* A sweet cat is curling on a pink blanket.	* An orange fixture is hanging in a messy kitchen.	* An ocean boat is travelling in a narrow water.

In this work, we proposed a method to describe an image in a sentence.
It is based on annotating an image with linguistically motivated phrases.
These phrases are combined to generate image description.

Related Publications

Yashaswi Varma and C V Jawahar - Image Annotation using Metric Learning in Semantic Neighbourhoods Proceedings of 12th European Conference on Computer Vision, 7-13 Oct. 2012, Print ISBN 978-3-642-33711--6, Vol. ECCV 2012, Part-III, LNCS 7574, pp. 114-128, Firenze, Italy. [PDF]

Ankush Gupta, Yashaswi Verma and C. V. Jawahar - Choosing Linguistics Over Vision to Describe Images, In AAAI, 2012. [paper] [presentation] [poster]

People

Yashaswi Verma
C. V. Jawahar

Decomposing Bag of Words Histograms

Abstract

motivation

We aim to decompose a global histogram representation of an image into histograms of its associated objects and regions. This task is formulated as an optimization problem, given a set of linear classifiers, which can effectively discriminate the object categories present in the image. Our decomposition bypasses harder problems associated with accurately localizing and segmenting objects. We evaluate our method on a wide variety of composite histograms, and also compare it with MRF-based solutions. In addition to merely measuring the accuracy of decomposition, we also show the utility of the estimated object and background histograms for the task of image classification on the PASCAL VOC 2007 dataset.

People

Ankit Gandhi

Karteek Alahari

C. V. Jawahar

Motivation

An SVM classifier is often trained to recognize only a single class category. When multiple objects (or uncorrelated noise) are present in an image, the performance deteriorates. To better understand this issue let us consider a split of the PASCAL VOC 2007 test data into images containing a single class category (PASCAL-S) and multiple class categories (PASCAL-M). In this setting, the average precision (AP) of the BoW-trained SVM classifier for the category “cat” is 0.589 on PASCAL-S, while only 0.189 on PASCAL-M. Also, it has been observed that B o W histograms of sin- gle isolated objects are relatively easy to classify. For example, accuracy as high as 77.78% is reported on Caltech 101 dataset, while more complex images, which contain multiple objects and natural clutter are harder to work with (e.g. 62.8% is still the best score on PASCAL VOC 2007). An important reason for this deterioration in performance is the fact that a classifier trained on single objects often fails to recognize the object when the global image representation (BoW) is “corrupted” by additional objects and clutter present in the image. A question of interest to us now is the following. Is it possible to filter out the clutter and classify only the signal?

Paper

Ankit Gandhi, Karteek Alahari and C V Jawahar - Decomposing Bag of Words Histograms Proceedings of International Conference on Computer Vision, 1-8th Dec.2013, Sydney, Australia. [PDF]

[poster] [bibtex]

Datasets

PASCAL VOC 2007 [Click here]
Flickr Multiple Object Dataset [Readme] [Download] [example images]
Composed Caltech [Readme]

Code

TBA

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.

Action Recognition using Canonical Correlation Kernels

Introduction

Action recognition has gained significant attention from the computer vision community in recent years. This is a challenging problem, mainly due to the presence of significant camera motion, viewpoint transitions, varying illumination conditions and cluttered backgrounds in the videos. A wide spectrum of features and representations has been used for action recognition in the past. Recent advances in action recognition are propelled by (i) the use of local as well as global features, which have significantly helped in object and scene recognition, by computing them over 2D frames or over a 3D video volume (ii) the use of factorization techniques over video volume tensors and defining similarity measures over the resulting lower dimensional factors. In this project, we try to take advantages of both these approaches by defining a canonical correlation kernel that is computed from tensor representation of the videos. This also enables seamless feature fusion by combining multiple feature kernels.

yt volleyball spiking

Canonical Correlation Kernel (CCK)

We represent a video using a 3D tensor, then 3D tensor is flattened into three 2D tensors/matrices. The CCK, which is based on canonical correlation analysis (CCA) and its kernelized version (KCCA), between the two videos is defined as the sum of correlations between the corresponding flattened matrices obtained from both CCA and KCCA. The overview of the computation of CCK is given below

flow chart New

Results

We tested CCK on four popular action recognition datasets: Cambridge, UCF, KTH and Youtube. CCK kernels are computed over Intensity values, HOG, HOF, SIFT and MBH features. For combining different CCK feature kernels, we have used simple weighted scheme. All these kernels are used with SVM classifier, with one-vs-rest approach.

	Cambridge CCK DT	UCF CCK DT	KTH CCK DT	Youtube CCK DT
Pixels	93.1 -	93.5 -	97.5 -	82.5 -
HOG	89.0 -	83.8 83.8	98.3 86.5	83.2 74.5
SIFT	95.1 -	85.7 -	98.6 -	79.1 -
HOF	95.2 -	81.5 77.6	94.3 93.2	80.4 72.8
MBH	75.1 -	80.4 84.8	98.9 95.0	80.1 83.9
Combined	96.4 -	93.5 88.2	98.9 94.2	86.3 84.2

Comparison of CCK with DT (Dense trajectories) over different features.

	Cambridge	UCF	KTH	Youtube
TCCA [CVPR 2007]	82	-	95.3	-
Product Manifold [CVPR 2010]	88	-	97	-
Tangent Bundle [FG 2011]	91	88	97	-
Dense trajectories [CVPR 2011]	-	88.2	94.2	84.2
*Le et al.* [CVPR 2011]**	-	86.5	93.9	75.8
*Ikizler-Cinbis et al.* [ECCV 2010]**	-	-	-	75.2
*Jiang Wang et al.* [CVPR 2011]**	-	-	93.8	-

Proposed (Using pixel values)	93.1	93.5	97.5	82.5
Proposed (Using multiple features)	96.4	93.5	98.9	86.3
Proposed (CCK feature kernels + DT feature kernels)	97.2	93.5	98.9	86.6

Comparison of CCK with other methods.

Related Publications

G Nagendar, Sai Ganesh, Mahesh Goud, C V Jawahar - Action Recognition using Canonical Correlation Kernels The 11th Asian Conference on Computer Vision, 5-9 Nov. 2012, Daejeon, Korea. [PDF]

Motion Trajectory Based Video Retrieval

Introduction


Original Video	Query

It's a sample video taken from UCF Sports Action dataset. The content of the video is difficult to

express in words and will vary from person to person. On the right-hand side a sample online

user sketch which can be used to retrieve the video has been shown

Content based video retrieval is an active area in Computer Vision. The most common type of retrieval strategies we know about are query by text (Youtube, Vimeo etc) and query by example-video or image (Video Google). When the query is in the form of a text, most of the current systems search the tags and metadata associated with the video. A problem with such an approach is that the tags or metadata need not be the real content of the video and are misleading. Moreover, often the queries may be abstract and lenghty. For example, " a particular diving style in swimming where the swimmer does three somersaults before diving " The other method i.e the query by example paradigm is limited by the absence of an example in hand at the right time.

Instead, certain videos can be identified from unique motion trajectories. For example the query mentioned above or another query like "the first strike in carrom where three or more carrom men or disks go to pockets" or "All red cars which came straight from North and then took a left turn". Clearly, queries like these describe the actual content of the video, which is unlikely to be found from the tags and metadata. Under these circumstances, queries can be framed using sketches. A basic sketch containing the object and the motion patterns of the object should suffice to describe the actual content of the video. Sketches can be offline (images) or online(temporal data collected using a tablet).

In this project we are trying to build a sketch-based video retrieval system using online sketches as queries.

Challenges

Although sketch-based search appears to be a very intuitive way to depict the content of a video, it suffers from perceptual variability. In simple words multiple users perceive the same motion in different ways. The variability is in terms of spatio-temporal properties like shape, direction, scale and speed. The following figures illustrate the problem precisly.


Original Video	User: 1	User: 2	User: 3

A sample sinusoidal motion and its three different user interpretations

If we try to match these different representations in 2D Euclidean space then they wont match. Beacause quantitatively they are different. But they have qualitative similarity. So we have to project these sketches to a space where they are mapped similarly.

Another set of challenges is involved with extraction of robust trajectories from unconstrained videos. Real word videos taken using handheld or mobile cameras contain camera motion and blur. On the other hand dynamic background, illuminated surfaces, non-separable foreground and background are very common. Fast moving objects also pose serious challenges to tracking. Tracking in unconstrained videos is a very active problem in Computer Vision.

Contribution

We have defined and extracted features from the user sketches and videos which give us a qualitative understanding of the trajectories. There are four different type of features that we extract. We try to capture approximate shape, order and ditection of the trajectories and then combine them using a multilevel retrieval strategy. Our mutilevel retrieval strategy gives us a two-fold advantage.

Firstly, it lets us combine the effect of multiple feature vectors of the same trajectory. The different feature vector captures different aspect of motion which is difficult.
Secondly, temporal data suffers from the problem of unequal length feature vector. Our method handles this issue intelligently.
Thirdly, our filters are arranged in increasing order of complexity and hence like a cascade they reduce the search space at each stage.

The adjacent figure explains our algorithm for retrieval. There are four different representations for the query and sketch. They are compared and matched at four different filters. The updated score is used for the final outcome.

Currently, we are focussing on extracting robust trajectories from unconstrained videos by improving some of the state of the art algorithms.

For a detailed description of the features used, please refer to our publication :
Koustav Ghosal, Anoop M. Namboodiri "A Sketch-Based Approach to Video Retrieval using Qualitative Features Proceedings of the Ninth Indian Conference on Computer Vision, Graphics and Image Processing, 14-17 Dec 2014, Bangalore, India.

Dataset

To validate our algorithm we had developed a dataset containing.

A set of 100 Pool videos. Each video has been extracted from multiple International Pool Championship matches. All of them are top-view and HD videos
A set of 100 Synthetic Videos. Each synthetic video represents a particular type of motion seen in real world scenarios.

Our datasets can be downloaded from the following links:

Please mail us at This email address is being protected from spambots. You need JavaScript enabled to view it. for the codes.

Results

queryExample

Related Publications

Koustav Ghosal, Anoop M. Namboodiri - A Sketch-Based Approach to Video Retrieval using Qualitative Features Proceedings of the Ninth Indian Conference on Computer Vision, Graphics and Image Processing, 14-17 Dec 2014, Bangalore, India. [PDF]

Associated People

Abstract

The notion of relative attributes as introduced by Parikh and Grauman (ICCV, 2011) provides an appealing way of comparing two images based on their visual properties (or attributes) such as “smiling” for face images, “naturalness” for outdoor images, etc. For learning such attributes, a Ranking SVM based formulation was proposed that uses globally represented pairs of annotated images. In this paper, we extend this idea towards learning relative attributes using local parts that are shared across categories.First, instead of using a global representation, we introduce a part-based representation combining a pair of images that specifically compares corresponding parts. Then, with each part we associate a locally adaptive “significance coefficient” that represents its discriminative ability with respect to a particular attribute. For each attribute, the significance-coefficients are learned simultaneously with a max-margin ranking model in an iterative manner. Compared to the baseline method , the new method is shown to achieve significant improvements in relative attribute prediction accuracy. Additionally, it is also shown to improve relative feedback based interactive image search.

CONTRIBUTIONS

Extend the idea of relative attributes (Parikh and Grauman ) to localized parts.
For each part we associate a locally adaptive “significance coefficient” that are learned simultaneously with a max-margin ranking model in an iterative manner.
Our method gives significant improvement (more than 10% on absolute scale) compared to the baseline method.
Introduce a new LFW-10 data set that has 10,000 pairs with instance-level annotations for 10 attributes.
Demonstrate application to interactive image search.

Method

Dataset

We randomly select 2000 images from LFW dataset . Out of these, 1000 images are used for creating training pairs and the remaining (unseen) 1000 for testing pairs. The annotations are collected for 10 attributes, with 500 training and testing pairs per attribute. In order to minimize the chances of inconsistency in the dataset, each image pair is got annotated from 5 trained annotators, and final annotation is decided based on majority voting.

LFW10

code

Relative Parts: Distinctive Parts for Learning Relative Attributes (CVPR, 2014)

Publication

Relative Parts: Distinctive Parts for Learning Relative Attributes

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014

Results

Top 10 parts learned using our method with maximum weights for each of the ten attributes in LFW-10 dataset. Greater is the intensity of red, more important is that part, and vice-versa.

Performance for each of the ten attributes in LFW-10 dataset using different methods and representations.

People

Acknowledgement

Yashaswi Verma is partly supported by MSR India PhD Fellowship 2013.