CVIT Projects

CVPR 2016 Paper - First Person Action Recognition Using Deep Learned Descriptors

PR 2016 Paper - Trajectory Aligned Features For First Person Action Recognition

NCVPRIPG 2015 Paper - Generic Action Recognition from Egocentric Videos

CVPR 2016 Paper - First Person Action Recognition Using Deep Learned Descriptors

FirstPersonActionRecognitionUsingDeepLearnedDescriptors1

Abstract

We focus on the problem of wearer’s action recognition in first person a.k.a. egocentric videos. This problem is more challenging than third person activity recognition due to unavailability of wearer’s pose, and sharp movements in the videos caused by the natural head motion of the wearer. Carefully crafted features based on hands and objects cues for the problem have been shown to be successful for limited targeted datasets. We propose convolutional neural networks (CNNs) for end to end learning and classification of wearer’s actions. The proposed network makes use of egocentric cues by capturing hand pose, head motion and saliency map. It is compact. It can also be trained from relatively small labeled egocentric videos that are available. We show that the proposed network can generalize and give state of the art performance on various disparate egocentric action datasets.

Method

architecture 2d 3d

Paper

PDF

Downloads

Code

Datasets and annotations

GTEA
- Video Frames (10GB)
- Labels [mirror]
Kitchen [Labels] *
ADL *
- P_04 [Annotation]
- P_05 [Annotation]
- P_06 [Annotation]
- P_09 [Annotation]
- P_11 [Annotation]
UTE *
- P01 [Annotation]
- P03 [Annotation]
- P032 [Annotation]

* Note: All videos are processed at 15 fps.

PR 2016 Paper - Trajectory Aligned Features For First Person Action Recognition

Abstract

Egocentric videos are characterised by their ability to have the first person view. With the popularity of Google Glass and GoPro, use of egocentric videos is on the rise. Recognizing action of the wearer from egocentric videos is an important problem. Unstructured movement of the camera due to natural head motion of the wearer causes sharp changes in the visual field of the egocentric camera causing many standard third person action recognition techniques to perform poorly on such videos. Objects present in the scene and hand gestures of the wearer are the most important cues for first person action recognition but are difficult to segment and recognize in an egocentric video. We propose a novel representation of the first person actions derived from feature trajectories. The features are simple to compute using standard point tracking and does not assume segmentation of hand/objects or recognizing object or hand pose unlike in many previous approaches. We train a bag of words classifier with the proposed features and report a performance improvement of more than 11% on publicly available datasets. Although not designed for the particular case, we show that our technique can also recognize wearer's actions when hands or objects are not visible.

Paper

PDF

Downloads

Code

Trajectory Aligned Features

Dataset and Annotations

Extreme Sports dataset

NCVPRIPG 2015 Paper - Generic Action Recognition from Egocentric Videos

Abstract

Egocentric cameras are wearable cameras mounted on a person’s head or shoulder. With their ability to have first person view, such cameras are spawning new set of exciting applications in computer vision. Recognising activity of the wearer from an egocentric video is an important but challenging problem. The task is made especially difficult because of unavailability of wearer’s pose as well as extreme camera shake due to motion of wearer’s head. Solutions suggested so far for the problem, have either focussed on short term actions such as pour, stir etc. or long term activities such as walking, driving etc. The features used in both the styles are very different and the technique developed for one style often fail miserably on other kind. In this paper we propose a technique to identify if a long term or a short term action is present in an egocentric video segment. This allows us to have a generic first-person action recognition system where we can recognise both short term as well as long term actions of the wearer. We report an accuracy of 90.15% for our classifier on publicly available egocentric video dataset comprising 18 hours of video amounting to 1.9 million tested samples.

Paper

PDF

Associated People

Face Fiducial Detection by Consensus of Exemplars

first page

Abstract

Facial fiducial detection is a challenging problem for several reasons like varying pose, appearance, expression, partial occlusion and others. In the past, several approaches like mixture of trees , regression based methods, exemplar based methods have been proposed to tackle this challenge. In this paper, we propose an exemplar based approach to select the best solution from among outputs of regression and mixture of trees based algorithms (which we call candidate algorithms). We show that by using a very simple SIFT and HOG based descriptor, it is possible to identify the most accurate fiducial outputs from a set of results produced by candidate algorithms on any given test image. Our approach manifests as two algorithms, one based on optimizing an objective function with quadratic terms and the other based on simple kNN. Both algorithms take as input fiducial locations produced by running state-of-the-art candidate algorithms on an input image, and output accurate fiducials using a set of automatically selected exemplar images with annotations. Our surprising result is that in this case, a simple algorithm like kNN is able to take advantage of the seemingly huge complementarity of these candidate algorithms, better than optimization based algorithms. We do extensive experiments on several datasets, and show that our approach outperforms state-of-the-art consistently. In some cases, we report as much as a 10% improvement in accuracy. We also extensively analyze each component of our approach, to illustrate its efficacy.

CONTRIBUTIONS

Our approach attempts the problem of fiducial detection as a classification problem of differentiating between the best vs the rest among fiducial detection outputs of state-of-the-art algorithms. To our knowledge, this is the first time such an approach has been attempted.
Since we only focus on selecting from a variety of solution candidates, this allows our pre-processing routine to generate outputs corresponding to a variety of face detector initialization, thus rendering our algorithm insensitive to initialization unlike other approaches.
Combining approaches better geared for sub-pixel accuracy and algorithms designed for robustness leads to our approach outperforming state-of-the-art in both accuracy and robustness.

Method

method

Code and Dataset

Code.

We evaluate our algorithms on three state of the art datasets LFPW, COFW and AFLW.

In case of queries/doubts, please contact This email address is being protected from spambots. You need JavaScript enabled to view it.

Related Publications

Mallikarjun B R, Visesh Chari, C. V. Jawahar , Akshay Asthana - Face Fiducial Detection by Consensus of Exemplars Proceedings of the IEEE Winter Conference on Applications of Computer Vision(WACV), 2016. [PDF]

Results

results

Associated People

Medical Image Perception

Insights about the behavioural and cognitive aspects that underlie the processes of reading medical images and their subsequent diagnosis by radiologists, are useful in many areas such as design of displays, improving training to reduce performance errors of radiologists, etc. We are interested in using such insights to develop visual search models and design novel image analysis algorithms.

Our current work focusses on the problem of reading and diagnosing from chest X-ray images. Specifically, studies are underway to understand the relationship between gaze patterns and abnormality detection performance.

People Involved

Varun
Samrudhdhi Rangrej

Fine-Grained Descriptions for Domain Specific Videos

Abstract

posResult

In this work, we attempt to describe videos from a specific domain - broadcast videos of lawn tennis matches. Given a video shot from a tennis match, we intend to generate a textual commentary similar to what a human expert would write on a sports website. Unlike many recent works that focus on generating short captions, we are interested in generating semantically richer descriptions. This demands a detailed low-level analysis of the video content, specially the actions and interactions among subjects. We address this by limiting our domain to the game of lawn tennis. Rich descriptions are generated by leveraging a large corpus of human created descriptions harvested from Internet. We evaluate our method on a newly created tennis video data set. Extensive analysis demonstrate that our approach addresses both semantic correctness as well as readability aspects involved in the task. We demonstrate the utility of the simultaneous use of vision, language and machine learning techniques in a domain specific environment to produce semantically rich and human-like descriptions. The proposed method can be well adopted to situations where activities are in a limited context and the linguistic diversity is confined.

overview

Results

negPos

results

Supplementary Video

Related Publications

Mohak Sukhwani, C. V. Jawahar - "Tennis Vid2Text: Fine-grained Descriptions for Domain Specific Videos" Proceedings of the 26th British Machine Vision Conference, 07-10 Sep 2015, Swansea, UK. [Paper][Supplementary][Abstract][Poster]

Dataset

Lawn Tennis Dataset: Dataset

Associated People

Fine-Tuning Human Pose Estimation in Videos

Digvijay Singh Vineeth Balasubramanian C. V. Jawahar

Overview

We propose a semi-supervised self-training method for fine-tuning human pose estimations in videos that provides accurate estimations even for complex sequences. We surpass state-of-the-art on most of the datasets used and also show a gain over the baseline on our new dataset of unrestricted sports videos. The self-training model presented has two components: a static Pictorial Structure based model and a dynamic ensemble of exemplars. We present a pose quality criteria that is primarily used for batch selection and automatic parameter selection. The same criteria works as a low-level pose evaluator used in post-processing. We set a new challenge by introducing a full human body-parts annotated complex dataset, CVIT-SPORTS, which contains complex videos from the sports domain. The strength of our method is demonstrated by adapting to videos of complex activities such as cricket-bowling, cricket-batting, football as well as available standard datasets.

Here we release our implementation of [1] for MATLAB software. To read more about the method, check the pdf on the left.

Downloads

Filename	Description	Size
fine_tuning_pose.tar.gz	Matlab code for fine-tuning human pose estimation in videos.	94 MB
README	Description on running the code and other info.	4.0 KB
cvit_sports_videos.tar.gz	CVIT-SPORTS-Videos dataset of 11 video sequences from cricket domain.	66 MB

References

[1] D. Singh, V. Balasubramanian, C. V. Jawahar. Fine-Tuning Human Pose Estimations in Videos . WACV 2016.

[2] Y. Yang, D. Ramanan. Articulated Pose Estimation using Flexible Mixtures of Parts. CVPR 2011.

[3] A. Cherian, J. Marial, K. Alahari, C. Schmid. Mixing Body-Part Sequences for Human Pose Estimation. CVPR 2014.

[4] B. Sapp, D. Weiss, B. Taskar. Parsing Human Motion with Stretchable Models. CVPR 2011.

[5] T. Malisiewicz, A. Gupta, A. Efros. Ensemble of Exemplar-SVMs for Object Detection and Beyond. ICCV 2011.

CVPR 2016 Paper - First Person Action Recognition Using Deep Learned Descriptors

Abstract

Method

Paper

PDF

Downloads

Code

Datasets and annotations

PR 2016 Paper - Trajectory Aligned Features For First Person Action Recognition

Abstract

Paper

PDF

Downloads

Code

Dataset and Annotations

NCVPRIPG 2015 Paper - Generic Action Recognition from Egocentric Videos

Abstract

Paper

PDF

Associated People

Face Fiducial Detection by Consensus of Exemplars

Abstract

CONTRIBUTIONS

Method

Code and Dataset

Related Publications

Results

Associated People

Medical Image Perception

People Involved

Fine-Grained Descriptions for Domain Specific Videos

Abstract

Related Publications

Dataset

Associated People

Fine-Tuning Human Pose Estimation in Videos

Digvijay Singh Vineeth Balasubramanian C. V. Jawahar

Overview

Downloads

References

More Articles …