Face Fiducial Detection by Consensus of Exemplars

first page


Facial fiducial detection is a challenging problem for several reasons like varying pose, appearance, expression, partial occlusion and others. In the past, several approaches like mixture of trees , regression based methods, exemplar based methods have been proposed to tackle this challenge. In this paper, we propose an exemplar based approach to select the best solution from among outputs of regression and mixture of trees based algorithms (which we call candidate algorithms). We show that by using a very simple SIFT and HOG based descriptor, it is possible to identify the most accurate fiducial outputs from a set of results produced by candidate algorithms on any given test image. Our approach manifests as two algorithms, one based on optimizing an objective function with quadratic terms and the other based on simple kNN. Both algorithms take as input fiducial locations produced by running state-of-the-art candidate algorithms on an input image, and output accurate fiducials using a set of automatically selected exemplar images with annotations. Our surprising result is that in this case, a simple algorithm like kNN is able to take advantage of the seemingly huge complementarity of these candidate algorithms, better than optimization based algorithms. We do extensive experiments on several datasets, and show that our approach outperforms state-of-the-art consistently. In some cases, we report as much as a 10% improvement in accuracy. We also extensively analyze each component of our approach, to illustrate its efficacy.


  • Our approach attempts the problem of fiducial detection as a classification problem of differentiating between the best vs the rest among fiducial detection outputs of state-of-the-art algorithms. To our knowledge, this is the first time such an approach has been attempted.
  • Since we only focus on selecting from a variety of solution candidates, this allows our pre-processing routine to generate outputs corresponding to a variety of face detector initialization, thus rendering our algorithm insensitive to initialization unlike other approaches.
  • Combining approaches better geared for sub-pixel accuracy and algorithms designed for robustness leads to our approach outperforming state-of-the-art in both accuracy and robustness.




Code and Dataset


We evaluate our algorithms on three state of the art datasets LFPW, COFW and AFLW.

In case of queries/doubts, please contact This email address is being protected from spambots. You need JavaScript enabled to view it.

Related Publications



lfpw failure rate graph  cofw failure rate graph  aflw failure rate graph 

Associated People


Medical Image Perception

Insights about the behavioural and cognitive aspects that underlie the processes of reading medical images and their subsequent diagnosis by radiologists, are useful in many areas such as design of displays, improving training to reduce performance errors of radiologists, etc. We are interested in using such insights to develop visual search models and design novel image analysis algorithms.

Our current work focusses on the problem of reading and diagnosing from chest X-ray images. Specifically, studies are underway to understand the relationship between gaze patterns and abnormality detection performance.

People Involved

Fine-Tuning Human Pose Estimation in Videos

Digvijay Singh     Vineeth Balasubramanian     C. V. Jawahar


fine tuning dg page 001We propose a semi-supervised self-training method for fine-tuning human pose estimations in videos that provides accurate estimations even for complex sequences. We surpass state-of-the-art on most of the datasets used and also show a gain over the baseline on our new dataset of unrestricted sports videos. The self-training model presented has two components: a static Pictorial Structure based model and a dynamic ensemble of exemplars. We present a pose quality criteria that is primarily used for batch selection and automatic parameter selection. The same criteria works as a low-level pose evaluator used in post-processing. We set a new challenge by introducing a full human body-parts annotated complex dataset, CVIT-SPORTS, which contains complex videos from the sports domain. The strength of our method is demonstrated by adapting to videos of complex activities such as cricket-bowling, cricket-batting, football as well as available standard datasets.

Here we release our implementation of [1] for MATLAB software. To read more about the method, check the pdf on the left.



FilenameDescription Size
fine_tuning_pose.tar.gz Matlab code for fine-tuning human pose estimation in videos. 94 MB
README Description on running the code and other info. 4.0 KB
cvit_sports_videos.tar.gz CVIT-SPORTS-Videos dataset of 11 video sequences from cricket domain. 66 MB


[1] D. Singh, V. Balasubramanian, C. V. Jawahar. Fine-Tuning Human Pose Estimations in Videos . WACV 2016. 

[2] Y. Yang, D. Ramanan. Articulated Pose Estimation using Flexible Mixtures of Parts. CVPR 2011.

[3] A. Cherian, J. Marial, K. Alahari, C. Schmid. Mixing Body-Part Sequences for Human Pose Estimation. CVPR 2014.

[4] B. Sapp, D. Weiss, B. Taskar. Parsing Human Motion with Stretchable Models. CVPR 2011.

[5] T. Malisiewicz, A. Gupta, A. Efros. Ensemble of Exemplar-SVMs for Object Detection and Beyond. ICCV 2011.


Fine-Grained Descriptions for Domain Specific Videos




In this work, we attempt to describe videos from a specific domain - broadcast videos of lawn tennis matches. Given a video shot from a tennis match, we intend to generate a textual commentary similar to what a human expert would write on a sports website. Unlike many recent works that focus on generating short captions, we are interested in generating semantically richer descriptions. This demands a detailed low-level analysis of the video content, specially the actions and interactions among subjects. We address this by limiting our domain to the game of lawn tennis. Rich descriptions are generated by leveraging a large corpus of human created descriptions harvested from Internet. We evaluate our method on a newly created tennis video data set. Extensive analysis demonstrate that our approach addresses both semantic correctness as well as readability aspects involved in the task. We demonstrate the utility of the simultaneous use of vision, language and machine learning techniques in a domain specific environment to produce semantically rich and human-like descriptions. The proposed method can be well adopted to situations where activities are in a limited context and the linguistic diversity is confined.







Supplementary Video



Related Publications


Associated People

Fine Pose Estimation of Known Objects from a Single Scene Image


Understanding the precise 3D structure of an environ- ment is one of the fundamental goals of computer vision and is challenging due to a variety of factors such as ap- pearance variation, illumination, pose, noise, occlusion and scene clutter. A generic solution to the problem is ill-posed due to the loss of depth information during imaging. In this paper, we consider a specific but common situation, where the scene contains known objects. Given 3D models of a set of known objects and a cluttered scene image, we try to detect these objects in the image, and align 3D models to their images to find their exact pose. We develop an ap- proach that poses this as a 3D-to-2D alignment problem. We also deal with pose estimation of 3D articulated objects in images. We evaluate our proposed method on BigBird dataset and our own tabletop dataset, and present experi- mental comparisons with state-of-the-art methods.

scene 10 dense 30 slr rgb cropped NP3 3 1 scene 15 XXX 45 SLR RGB cropped
scene 10 dense 30 slr rgb cropped cadfitted NP3 3 scene 15 XXX 45 SLR RGB cropped cadfitted

The idea of the problem is given a single cluttered scene image, is it possible to find the pose of objects in it. As observed, our method is robust to objects of various shape, sizes and texture and also in terms of object-object occlusion.


We build on previous approaches like Jim et el. [2] and Aubry et al. [3]. Jim et al. use a user-driven set of 2D-to-3D correspondences to extract geometric cues to refine the correspondence set to obtain the pose using ICP. Aubry et al. on the other hand, use whitened HoG as discriminative features to select the closest matching image from a set of rendered viewpoints as the final pose. Hence, the final pose is dependant on the sampling of the camera viewpoints. Our goal differs from the above primarily in its assumptions about the objects and their models. We try to find the pose of common tabletop objects that could potentially be low-lying and the models are assumed to be without any texture, making the alignment problem hard. Our major contributions include: 1) An ensemble of shape features that work well for aligning textureless 3D models, 2) A two- stage alignment scheme that is efficient and accurate and 3) An extension of the proposed approach to handle articulated objects. We demonstrate our results on a variety of tabletop objects including transparent ones and scene images with occlusion and background clutter. Note that textureless 3D models are used to generalize our proposed method to objects that are very similar in shape and size, but vary in texture. Experimental results show that the proposed method outperforms the state-of-the-art method for pose estimation.

Method Overview




We evaluate our proposed method on BigBird [4] dataset and our own dataset of common household objects: TableTop. Both BigBird and TableTop contains RGB images of a single object at various orientations, captured under controlled environment. Additionally, TableTop dataset contains cluttered scene images having multiple objects having occlusion constraints. The dataset statistics are given below:

For each object, images are captured at an azimuth interval of 3° and elevation interval of 18°, making a total of 120 images per elevation angle.
Number of objects: 15
Number of images per object: 600
Total: 9000

For each object, images are captured at an azimuth interval of 100 and at elevations of 0°, 15°, 30°, 45°, and 60°, and one image from 90° elevation.
Number of objects: 50
Number of images per object: 181
Total: 9050

Results and Comparisons

Qualitative Results:

IMG 3037 scene 6 medium 30 slr rgb cropped scene 7 medium 45 slr rgb cropped scene 9 dense 45 slr rgb cropped
IMG 3037 scene 6 medium 30 slr rgb cadfitted2 scene 7 medium 45 slr rgb cadfitted scene 9 dense 45 slr rgb cadfitted

 Top row: Input scene images Bottom row: Scene images with 3D model poses. Models corresponding to the objects are fetched from the repository and superimposed with their correct pose.


Quantitative Results:


rough pose fine pose
Left: Error in initial hypothesis. Right: Error in refined pose. As observed, out method refines upto 6° of groundtruth for 80% of the examples.


Qualitative Comparison:

airplane test image superimposed test image superimposed my
hourglass hourglass superimposed hourglass superimposed my

Qualitative comparison of proposed method with S3DC. As observed from the figure, proposed method is able to provide a more refined pose compared to S3DC.

Qualitative Comparison:

  Transparent Bottle Sand Clock Pen Stand Scissors Spectacles Case
  Accuracy (%) MPE Accuracy (%) MPE Accuracy (%) MPE Accuracy (%) MPE Accuracy (%) MPE
S3DC 31.11 0.147 83.33 0.086 91.11 0.093 6.67 0.4 44.44 0.016
Ours 83.52 0.0016 86.81 0.033 83.52 0.002 73.63 0.086 54.65 0.121

We compare the classification accuracy and mean pose error (MPE) with S3DC, for some of the objects from our dataset with varying complexity in terms of shape, size and material.


  BigBird TableTop
  Accuracy (%) MPE Accuracy (%) MPE
S3DC 34.5 0.013 45.7 0.044
Ours 49.7 0.008 67.3 0.021

We compare the classification accuracy and mean pose error (MPE) with S3DC, for some of the objects from our dataset with varying complexity in terms of shape, size and material.

Related Publications

  • Sudipto Banerjee, Sanchit Aggarwal, Anoop M. Namboodiri - Fine Pose Estimation of Known Objects in Cluttered Scene Images Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, 03-06 Nov 2015, Kuala Lumpur, Malaysia. [PDF]

  • M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic - Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models Computer Vision and Pattern Recognition, 2014. [PDF]
  • J. J. Lim, H. Pirsiavash, and A. Torralba - Parsing ikea objects: Fine pose estimation International Conference on Computer Vision, 2013. [PDF]
  • A. Singh, J. Sha, K. S. Narayan, T. Achim, and P. Abbeel - Bigbird: A large-scale 3d database of object instances International Conference on Robotics and Automation, 2014. [PDF]

Code and Dataset

  • Code: To be updated soon
  • BigBird dataset can be downloaded from here
  • TableTop dataset can be downloaded here

Associated People