CVPR 2016 Paper - First Person Action Recognition Using Deep Learned Descriptors

FirstPersonActionRecognitionUsingDeepLearnedDescriptors1

Abstract

We focus on the problem of wearer’s action recognition in first person a.k.a. egocentric videos. This problem is more challenging than third person activity recognition due to unavailability of wearer’s pose, and sharp movements in the videos caused by the natural head motion of the wearer. Carefully crafted features based on hands and objects cues for the problem have been shown to be successful for limited targeted datasets. We propose convolutional neural networks (CNNs) for end to end learning and classification of wearer’s actions. The proposed network makes use of egocentric cues by capturing hand pose, head motion and saliency map. It is compact. It can also be trained from relatively small labeled egocentric videos that are available. We show that the proposed network can generalize and give state of the art performance on various disparate egocentric action datasets.

Method

architecture 2d 3d

fusion architecture

Paper

PDF

Downloads

Code

Datasets and annotations

* Note: All videos are processed at 15 fps.

 


 
 

PR 2016 Paper - Trajectory Aligned Features For First Person Action Recognition

trajectory

Abstract

Egocentric videos are characterised by their ability to have the first person view. With the popularity of Google Glass and GoPro, use of egocentric videos is on the rise. Recognizing action of the wearer from egocentric videos is an important problem. Unstructured movement of the camera due to natural head motion of the wearer causes sharp changes in the visual field of the egocentric camera causing many standard third person action recognition techniques to perform poorly on such videos. Objects present in the scene and hand gestures of the wearer are the most important cues for first person action recognition but are difficult to segment and recognize in an egocentric video. We propose a novel representation of the first person actions derived from feature trajectories. The features are simple to compute using standard point tracking and does not assume segmentation of hand/objects or recognizing object or hand pose unlike in many previous approaches. We train a bag of words classifier with the proposed features and report a performance improvement of more than 11% on publicly available datasets. Although not designed for the particular case, we show that our technique can also recognize wearer's actions when hands or objects are not visible.

Paper

PDF

Downloads

Code

Dataset and Annotations

 


 
 

NCVPRIPG 2015 Paper - Generic Action Recognition from Egocentric Videos

GenericEgo

Abstract

Egocentric cameras are wearable cameras mounted on a person’s head or shoulder. With their ability to have first person view, such cameras are spawning new set of exciting applications in computer vision. Recognising activity of the wearer from an egocentric video is an important but challenging problem. The task is made especially difficult because of unavailability of wearer’s pose as well as extreme camera shake due to motion of wearer’s head. Solutions suggested so far for the problem, have either focussed on short term actions such as pour, stir etc. or long term activities such as walking, driving etc. The features used in both the styles are very different and the technique developed for one style often fail miserably on other kind. In this paper we propose a technique to identify if a long term or a short term action is present in an egocentric video segment. This allows us to have a generic first-person action recognition system where we can recognise both short term as well as long term actions of the wearer. We report an accuracy of 90.15% for our classifier on publicly available egocentric video dataset comprising 18 hours of video amounting to 1.9 million tested samples.

Paper

PDF

 

Associated People

Face Fiducial Detection by Consensus of Exemplars


first page

Abstract

Facial fiducial detection is a challenging problem for several reasons like varying pose, appearance, expression, partial occlusion and others. In the past, several approaches like mixture of trees , regression based methods, exemplar based methods have been proposed to tackle this challenge. In this paper, we propose an exemplar based approach to select the best solution from among outputs of regression and mixture of trees based algorithms (which we call candidate algorithms). We show that by using a very simple SIFT and HOG based descriptor, it is possible to identify the most accurate fiducial outputs from a set of results produced by candidate algorithms on any given test image. Our approach manifests as two algorithms, one based on optimizing an objective function with quadratic terms and the other based on simple kNN. Both algorithms take as input fiducial locations produced by running state-of-the-art candidate algorithms on an input image, and output accurate fiducials using a set of automatically selected exemplar images with annotations. Our surprising result is that in this case, a simple algorithm like kNN is able to take advantage of the seemingly huge complementarity of these candidate algorithms, better than optimization based algorithms. We do extensive experiments on several datasets, and show that our approach outperforms state-of-the-art consistently. In some cases, we report as much as a 10% improvement in accuracy. We also extensively analyze each component of our approach, to illustrate its efficacy.


CONTRIBUTIONS

  • Our approach attempts the problem of fiducial detection as a classification problem of differentiating between the best vs the rest among fiducial detection outputs of state-of-the-art algorithms. To our knowledge, this is the first time such an approach has been attempted.
  • Since we only focus on selecting from a variety of solution candidates, this allows our pre-processing routine to generate outputs corresponding to a variety of face detector initialization, thus rendering our algorithm insensitive to initialization unlike other approaches.
  • Combining approaches better geared for sub-pixel accuracy and algorithms designed for robustness leads to our approach outperforming state-of-the-art in both accuracy and robustness.

Method

method

 


Code and Dataset

Code.

We evaluate our algorithms on three state of the art datasets LFPW, COFW and AFLW.

In case of queries/doubts, please contact This email address is being protected from spambots. You need JavaScript enabled to view it.


Related Publications


Results

results 

lfpw failure rate graph  cofw failure rate graph  aflw failure rate graph 

Associated People

 

Fine-Grained Descriptions for Domain Specific Videos


Abstract

posResult

 

In this work, we attempt to describe videos from a specific domain - broadcast videos of lawn tennis matches. Given a video shot from a tennis match, we intend to generate a textual commentary similar to what a human expert would write on a sports website. Unlike many recent works that focus on generating short captions, we are interested in generating semantically richer descriptions. This demands a detailed low-level analysis of the video content, specially the actions and interactions among subjects. We address this by limiting our domain to the game of lawn tennis. Rich descriptions are generated by leveraging a large corpus of human created descriptions harvested from Internet. We evaluate our method on a newly created tennis video data set. Extensive analysis demonstrate that our approach addresses both semantic correctness as well as readability aspects involved in the task. We demonstrate the utility of the simultaneous use of vision, language and machine learning techniques in a domain specific environment to produce semantically rich and human-like descriptions. The proposed method can be well adopted to situations where activities are in a limited context and the linguistic diversity is confined.


overview

 


Results

negPos

 

results


Supplementary Video

 

 


Related Publications


Dataset


Associated People

Medical Image Perception


Insights about the behavioural and cognitive aspects that underlie the processes of reading medical images and their subsequent diagnosis by radiologists, are useful in many areas such as design of displays, improving training to reduce performance errors of radiologists, etc. We are interested in using such insights to develop visual search models and design novel image analysis algorithms.

Our current work focusses on the problem of reading and diagnosing from chest X-ray images. Specifically, studies are underway to understand the relationship between gaze patterns and abnormality detection performance.


People Involved

Fine-Tuning Human Pose Estimation in Videos


Digvijay Singh     Vineeth Balasubramanian     C. V. Jawahar

Overview

fine tuning dg page 001We propose a semi-supervised self-training method for fine-tuning human pose estimations in videos that provides accurate estimations even for complex sequences. We surpass state-of-the-art on most of the datasets used and also show a gain over the baseline on our new dataset of unrestricted sports videos. The self-training model presented has two components: a static Pictorial Structure based model and a dynamic ensemble of exemplars. We present a pose quality criteria that is primarily used for batch selection and automatic parameter selection. The same criteria works as a low-level pose evaluator used in post-processing. We set a new challenge by introducing a full human body-parts annotated complex dataset, CVIT-SPORTS, which contains complex videos from the sports domain. The strength of our method is demonstrated by adapting to videos of complex activities such as cricket-bowling, cricket-batting, football as well as available standard datasets.

Here we release our implementation of [1] for MATLAB software. To read more about the method, check the pdf on the left.

 


Downloads 

FilenameDescription Size
fine_tuning_pose.tar.gz Matlab code for fine-tuning human pose estimation in videos. 94 MB
README Description on running the code and other info. 4.0 KB
cvit_sports_videos.tar.gz CVIT-SPORTS-Videos dataset of 11 video sequences from cricket domain. 66 MB

References 

[1] D. Singh, V. Balasubramanian, C. V. Jawahar. Fine-Tuning Human Pose Estimations in Videos . WACV 2016. 

[2] Y. Yang, D. Ramanan. Articulated Pose Estimation using Flexible Mixtures of Parts. CVPR 2011.

[3] A. Cherian, J. Marial, K. Alahari, C. Schmid. Mixing Body-Part Sequences for Human Pose Estimation. CVPR 2014.

[4] B. Sapp, D. Weiss, B. Taskar. Parsing Human Motion with Stretchable Models. CVPR 2011.

[5] T. Malisiewicz, A. Gupta, A. Efros. Ensemble of Exemplar-SVMs for Object Detection and Beyond. ICCV 2011.