Learning Deep and Compact Models for Gesture Recognition

Koustav Mullick


The goal of gesture recognition is to interpret human gestures, that can originate from any bodily motion, but mainly confided to face or hand, and interact with a computer through it without physically touching it. It can be seen as a way for computers to begin to understand human body language, thus building a richer bridge between machines and humans. Many approaches have been made using cameras and computer vision algorithms for interpretation of sign language, identification and recognition of posture, gait, proxemics, and human behaviors. However effective gesture detection and classification can be quite a challenging task. Firstly, there can be a wide range of variations in the way the gestures are being performed. It needs to be generic and robust enough to handle variations in surrounding conditions, appearances, noise and individuals performing the gestures. Secondly, developing a model that can give real-time predictions and can be run on low-power devices having limited memory and processing capacity, is another challenge. Since deep learning models tend to have a large number of parameters, it not only has the disadvantage of not being able to fit into a mobile device because of the huge model size but also makes it difficult to utilize them for real-time inferencing. In this thesis we try to address both the above mentioned difficulties. We propose an end-to-end trainable model capable of learning both spatial and temporal features present in a gesture video directly from the raw video frames. It is achieved by combining the strengths of 3D-Convolutional Neural Networks and Long Short Term Memory variant of Recurrent Neural Networks. Further, we also explore ways to reduce the parameter space of such models without compromising a lot on performance. Particularly we look at two ways of obtaining compact models, with less number of parameters. Learn smaller models making use of the idea of knowledge distillation and reduce large models’ sizes by performing weight pruning. Our first contribution is learning the joined, end-to-end trainable, 3D-Convolutional Neural Network and Long Short Term Memory. Convolutional Neural Networks preserve both spatial and temporal information over the layers and can identify patterns over short durations. But the inputs need to be of fixed size, which may not always hold true in case of videos. Whereas, Long Short Term Memories face no difficulties in preserving information over longer duration and it can also work with variable length input sequences. However, they do not preserve patterns and hence works better when fed with features that already has learned some amount of spatio-temporal information, instead of just the raw pixel information. The joined model leverages the advantages of both of them. Experimentally we verify as well that, that indeed is the scenario, as our joined model outperforms the individual baseline models.Additionally the components can be pre-trained initially and later fine-tuned in a complete end-to-end fashion to further boost the network’s potential to capture information. We obtain almost state-of-the art result using our proposed model on the ChaLearn-2014 dataset for sign language recognition from videos, but using much simpler model and training mechanism compared to the best model. In our second contribution, we look into ways to learn compact models that enables us to perform real-time inferencing on hand-held devices where power and memory are constraints. To this extent we distill or transfer knowledge from a larger teacher network to a smaller student network. Without teacher supervision, the student network did not have enough capacity to perform well just using class-labels. We demonstrate this on the same ChaLearn-2014 dataset. To the best of our knowledge, this is the first work to explore knowledge distillation from teacher to student network in video classification task. We also show that training networks using Adam optimization technique, combined with weight decay, helps to obtaining sparser models by pruning weights. Training with Adam encourages a lot of weights to become very low by penalizing high weight values and adjusting the learning rate accordingly. Removing the low-valued weights helps to obtain sparser models, compared to SGD (with weight-decay as well) trained models. Experimental results on both gesture recognition task and image classification task on the CIFAR dataset validates the findings.

Year of completion:  July 2018
 Advisor : Anoop M Namboodiri

Related Publications