Recognizing People in Image and Videos

Vijay Kumar


Cameras and mobile phones have become integral part of our everyday lives as they become portable, powerful and cheaper. We capture and share hundreds of pictures and videos with our friends, family and social connections. Similarly, large volume of such visual content is generated in surveillance, entertainment, and biometrics applications. Without any doubt, people are the most important objects that dominate in these visual content. For instance, photos taken in a family event or movie videos focus around humans. It is utmost important to automatically detect, identify and analyze people appearing in images to obtain a better understanding of these content and make decisions around them. In this thesis, we consider the problem of person detection and recognition in images. This is a well explored topic in vision community with a vast literature focused on these problems. The current state-of-the-art recognition systems are able to identify people with high degree of accuracy in scenarios where images have high resolution, contain visible and near frontal faces, and recognition systems have access to sufficiently large training gallery. However, these systems need significant improvement in challenging real-world applications such as surveillance or entertainment videos where one needs to handle several practical issues such as non-visibility of faces, limitation of training samples, domain mismatch, etc in addition to other instance variations such as pose, illumination, and resolution. While there are plenty of challenges pertaining to person recognition, we are interested in some of the open challenges that are relevant from the deployment perspective in diverse recognition scenarios. We first consider people detection that is a pre-requisite for a recognition system. We detect people in images by detecting their faces through an exemplar based detector. Exemplar approach detects faces through hough voting using an exemplar training database indexed with bag-of-words method. We propose two key ideas referred as “Visual phrases” and “Contextual weighting” into the exemplar approach that improves its performance significantly. We show that visual phrases which encode dependencies between visual features are discriminative and propose a strategy to incorporate them into exemplar voting. We also introduce the notion of spatial consistency for a visual feature which weights each occurrence of a feature based on its global context. Our evaluation of popular in-the-wild face detection benchmarks demonstrate significant improvement obtained using these proposed ideas. We then focus on person recognition and consider several issues encountered in practical recognition systems. We initially address the common and important issue of “unavailability of sufficient training samples” during recognition. We propose a solution based on semi-supervised learning that can efficiently learn from a small amount of labeled data and a large amount of unlabeled data. We demonstrate how the similarities between labeled and unlabeled samples can be effectively exploited to improve the performance. We then consider the problem of “domain mismatch” between training gallery (source) and probe instances (target). We consider a recognition setup in which the objective is to identify people in a collection of probe images using a training gallery collected from different domain. We propose a novel two-stage solution that generates the labels of a few confident seed images from the target domain and propagate their labels to remaining images using a graph based framework. We evaluate our approach in several practical recognition scenarios such as movie videos and photo-albums. We then consider a different recognition scenario in which faces are not completely visible due to occlusion or people facing away from the camera. To deal with such occluded and partially or completely non-visible faces, we exploit information from other body regions such as head, upper body and body to improve the recognition. When considering different body regions, pose of different body regions pose a serious challenge. To handle the issues of unreliable facial region and pose variation, we propose a technique that learns multiple pose-specific representations from different body regions. Our approach involves training a separate deep convolutional network for each pose and then combining their predictions using adaptive weights determined by the pose of the person. Person recognition approaches based on multiple body regions however require training multiple deep convolution networks for different body regions resulting in large number of parameters with slower training and testing procedures. To overcome these, we develop an end-to-end person recognition approach based on pooling and aggregation of discriminative features from multiple body regions. Our end-to-end convolutional network pools features from several pre-determined region of interests and adaptively aggregates them using an attention mechanism to produce a compact representation. We evaluate our single end-to-end trained model on multiple person recognition benchmarks and show its effectiveness over multiple models trained on different body regions. We finally note that all of our work is developed with a keen focus on their applicability in real world applications. We have created and publicly released datasets and source code during the process.


Year of completion:  July 2019
 Advisor : Anoop M Namboodiri and C.V. Jawahar

Related Publications

  • Vijay Kumar, Anoop Namboodiri and C.V. Jawahar - Semi-supervised annotation of faces in image collection. Signal, Image and Video Processing (2017). [PDF]

  • Vijay Kumar, Anoop Namboodiri, Manohar Paluri and C. V. Jawahar - Pose-Aware Person Recognition Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [PDF]

  • Vijay Kumar, R. Raghavendra, Anoop Namboodiri and Christoph Busch - Robust transgender face recognition: Approach based on appearance and therapy factors Identity, Security and Behavior Analysis (ISBA), 2016 IEEE International Conference on. IEEE, 2016. [PDF]

  • Vijay Kumar R., Anoop M. Namboodiri, C.V. Jawahar - Visual Phrases for Exemplar Face Detection Proceedings of the International Conference on Computer Vision (ICCV 2015), 13-16 Dec 2015, Santiago, Chile. [PDF]

  • Hiba Ahsan, Vijay Kumar, C. V. Jawahar - Multi-Label Annotation of Music Proceedings of the Eighth International Conference on Advances in Pattern Recognition,04-07 Jan 2015, Kolkata, India. [PDF]

  • Vijay Kumar, Anoop M. Namboodiri and C.V. Jawahar - Face Recognition in Videos by Label Propogation Proceedings of the 22nd International Conference on Pattern Recognition, 24-28 Aug 2014, Stockholm, Sweden. [PDF]

  • Vijay Kumar, Harit Pandya and C.V. Jawahar - Identifying Ragas in India Music Proceedings of the 22nd International Conference on Pattern Recognition, 24-28 Aug 2014, Stockholm, Sweden.[PDF]

  • Shankar Setty, Moula Husain, Parisa Beham, Jyothi Gudavalli, Menaka Kandasamy, Radehsyam Vaddi, Vidyagouri Hemadri, JC Karure, Raja Raju, B.Rajan, Vijay Kumar and C V Jawahar - Indian Movie Face Database: A Benchmark for Face Recognition Under Wide Variations Proceedings of the IEEE National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, 18-21 Dec. 2013, Jodhpur, India. [PDF]

  • Vijay Kumar, Anoop M Namboodiri and C V Jawahar - Sparse Representation based Face Recognition with Limited Labeled Samples Proceedings of the 2nd Asian Conference Pattern Recognition, 05-08 Nov. 2013, Okinawa, Japan. [PDF]

  • Vijay Kumar, Amit Bansal, Goutam Hari Tulsiyan, Anand Mishra, Anoop M. Namboodiri, C V Jawahar - Sparse Document Image Coding for Restoration Proceedings of the 12th International Conference on Document Analysis and Recognition, 25-28 Aug. 2013, Washington DC, USA. [PDF]