Human head pose and emotion analysis

Aryaman Gupta


Scene analysis has been a topic of great interest in computer vision. Humans are the most important and most complex subject involved in scene analysis. Humans exhibit different forms of expressions and behaviour with its environment. These interactions with its environment have been in study for a long time and to interpret these interactions various challenges and tasks have been identified. We focus on two tasks in particular: Head Pose estimation and Emotion recognition. Head poses are an important mean of non-verbal human communication and thus a crucial element in understanding human interaction with its environment. Head pose estimation allows a robot to estimate the region of focus of attention for an individual. Head pose estimation requires learning a model that computes the intrinsic Euler angles for pose (yaw, pitch, roll) from an input image of the human face. Annotating ground truth head pose angles for images in the wild is difficult and requires ad-hoc fitting procedures (which provides only coarse and approximate annotations). This highlights the need for approaches which can train on data captured in a controlled environment and generalize on the images in the wild (with varying appearance and illumination of the face). Most present day deep learning approaches which learn a regression function directly on the input images fail to do so. To this end, we propose to use a higher level representation to regress the head pose while using deep learning architectures. More specifically, we use the uncertainty maps in the form of 2D soft localization heatmap images over five facial keypoints, namely left ear, right ear, left eye, right eye and nose, and pass them through a convolutional neural network to regress the head-pose. We show head pose estimation results on two challenging benchmarks BIWI and AFLW and our approach surpasses the state of the art on both the datasets. We also propose a synthetically generated dataset for head pose estimation. Emotions are fundamental to human lives and decision-making. Human emotion detection can be helpful in understanding human mood, intent or choice of action. Recognizing emotions from images or video accurately is not easy for humans themselves and for machines it is even more challenging as humans express their emotions in different forms and there is a lack of temporal boundaries among emotions. Facial Expression Recognition has remained a challenging and interesting problem in computer vision. Despite efforts made in developing various methods for facial expression recognition, existing approaches lack generalizability when applied to unseen images or those that are captured in wild setting (i.e. the results are not significant). We propose use of facial action unit’s soft localization heatmap images for facial expression recognition. To account for lack of large well labelled dataset we propose a method for automated spectrogram annotation where we use two modalities(visual and textual) used in expression of emotion by humans to label one other modality(speech) for emotion recognition.

Year of completion:  March 2021
 Advisor : Vineet Gandhi

Related Publications