Human Pose Retrieval for Image and Video collections

Nataraj Jammalamadaka (homepage)

Abstract

With overwhelming amount of visual data on the internet, it is beyond doubt that a search capability for this data is needed. While several commercial systems have been built to retrieve images and videos using meta-data, many of the images and videos do not have a detailed descriptions. This problem has been addressed by content based image retrieval (CBIR) systems which retrieve the images by their visual content. In this thesis, we will demonstrate that images and videos can be retrieved using the pose of the humans present in them. Here pose is the 2D/3D spatial arrangement of anatomical body parts like arms and legs. Pose is an important information which conveys action, gesture and the mood of the person. Retrieving humans using pose has commercial implications in domains such as dance (query being a dance pose) and sports (query being a shot). In this thesis, we propose three pose representations that can be used for retrieval. Using one of these representations, we will build a real-time pose retrieval system over million movie frames. Our first pose representation is based on the output of human pose estimation algorithms (HPE) [5, 26, 80, 103] which estimate the pose of the person given an image. Unfortunately, these algorithms are not entirely reliable and often make mistakes. We solve this problem by proposing an evaluator that predicts if a HPE algorithm has succeeded. We describe the stages required to learn and test an evaluator, including the use of an annotated ground truth dataset for training and testing the evaluator, and the development of auxiliary features that have not been used by the (HPE) algorithm, but can be learnt by the evaluator to predict if the output is correct or not. To demonstrate our ideas, we build an evaluator for each of four recently developed HPE algorithms using their publicly available implementations: Andriluka et al. [5], Eichner and Ferrari [26], Sapp et al. [80], and Yang and Ramanan [103]. We demonstrate that in each case our evaluator is able to predict whether the algorithm has correctly estimated the pose or not. In this context we also provide a new dataset of annotated stickmen. Further, we propose innovative ways in which a pose evaluator can be used. Specifically, we show how a pose evaluator can be used to filter incorrect pose estimates, to fuse outputs from different HPE algorithms, and to improve a pose search application. Our second pose representation is inspired by poselets [13] which are body detectors. First, we introduce deep poselets for pose-sensitive detection of various body parts, that are built on convolutional neural network (CNN) features. These deep poselets significantly outperform previous instantiations of Berkeley poselets [13]. Second, using these detector responses, we construct a pose representation that is suitable for pose search, and show that pose retrieval performance is on par with the state of the artpose representations. The compared methods include Bag of visual words [85], Berkeley poselets [13] and Human pose estimation algorithms [103, 18, 68]. All the methods are quantitatively evaluated on a large dataset of images built from a number of standard benchmarks together with frames from Hollywood movies. Our third pose representations is based on embedding an image into a lower dimensional pose-sensitive manifold. Here we make the following contributions: (a) We design an optimized neural network which maps the input image to a very low dimensional space where similar poses are close by and dissimilar poses are farther away, and (b) We show that pose retrieval system using these low dimensional representation is on par with the deep poselet representation. Finally, we describe a method for real time video retrieval where the task is to match the 2D human pose of a query. A user can form a query by (i) interactively controlling a stickman on a web based GUI, (ii) uploading an image of the desired pose, or (iii) using the Kinect and acting out the query himself. The method is scalable and is applied to a dataset of 22 movies totaling more than three million frames. The real time performance is achieved by searching for approximate nearest neighbors to the query using a random forest of K-D trees. Apart from the query modalities, we introduce two other areas of novelty. First, we show that pose retrieval can proceed using a low dimensional representation. Second, we show that the precision of the results can be improved substantially by combining the outputs of independent human pose estimation algorithms. The performance of the system is assessed quantitatively over a range of pose queries.

Year of completion:	July 2017
Advisor :	C V Jawahar and Andrew Zisserman

Related Publications

Nataraj Jammalamadaka, Andrew Zisserman and C.V. Jawahar - Human pose search using deep networks Image and Vision Computing 59 (2017): 31-43. [PDF]
Nataraj Jammalamadaka, Andrew Zisserman, C.V. Jawahar - Human Pose Search using Deep Poselets Proceedings of the 11th IEEE International Conference on Automatic Face and Gesture Recognition, 04-08 May 2015, Ljubljana, Slovnia.[PDF]
Nataraj Jammalamadaka, Andrew Zisserman and C.V. Jawahar - Human Pose Search using Deep Poselets Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on. Vol. 1. IEEE, 2015. [PDF]
Digvijay Singh, Ayush Minocha, Nataraj Jammalamadaka and C. V. Jawahar - Real-time Face Detection, Pose Estimation and Landmark Localization Proceedings of the IEEE National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 18-21 Dec. 2013, Jodhpur, India. [PDF]
Nataraj Jammalamadaka, Ayush Minocha, Digvijay Singh and C V Jawahar - Parsing Clothes in Unrestricted Images Proceedings of the 24th British Machine Vision Conference (BMVC), 09-13 Sep. 2013, Bristol, UK. [PDF]
Digvijay Singh, Ayush Minosha, Nataraj Jammalamadaka and C.V. Jawahar - Near Real-Time Face Parsing Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), 2013 Fourth National Conference on. IEEE, 2013. [PDF]
Nataraj Jammalamadaka, Andrew Zisserman, Marcin Eichner, Vittorio Ferrari and C V Jawahar - Video Retrieval by Mimicking Poses Proceedings of ACM International Conference on Multimedia Retrieval (ICMR), 5-8 June 2012, Article No.34, ISBN 978-1-4503-1329-2, Hong Kong. [PDF]
Nataraj Jammalamadaka, Andrew Zisserman, Marcin Eichner, Vittorio Ferrari and C V Jawahar - Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators Proceedings of 12th European Conference on Computer Vision, 7-13 Oct. 2012, Print ISBN 978-3-642-33711--6, Vol. ECCV 2012, Part-III, LNCS 7574, pp. 114-128, Firenze, Italy. [PDF]
Nataraj Jammalmadaka, Vikram Pudi and C. V. Jawahar - Efficient Search with Changing Similarity Measures on Large Multimedia Datasets Proc. of The International Multimedia Modelling Conference(MMM2007), LNCS 4352, Part-II, PP. 206-215, 2007. [PDF]
C. V. Jawahar, Balakrishna Chennupati, Balamanohar Paluri and Nataraj Jammalamadaka, Video Retrieval Based on Textual Queries , Proceedings of the Thirteenth International Conference on Advanced Computing and Communications (ICACCS), Coimbatore, December 2005. [PDF]