Thesis Students

Learning Deep and Compact Models for Gesture Recognition

Koustav Mullick

Abstract

The goal of gesture recognition is to interpret human gestures, that can originate from any bodily motion, but mainly confided to face or hand, and interact with a computer through it without physically touching it. It can be seen as a way for computers to begin to understand human body language, thus building a richer bridge between machines and humans. Many approaches have been made using cameras and computer vision algorithms for interpretation of sign language, identification and recognition of posture, gait, proxemics, and human behaviors. However effective gesture detection and classification can be quite a challenging task. Firstly, there can be a wide range of variations in the way the gestures are being performed. It needs to be generic and robust enough to handle variations in surrounding conditions, appearances, noise and individuals performing the gestures. Secondly, developing a model that can give real-time predictions and can be run on low-power devices having limited memory and processing capacity, is another challenge. Since deep learning models tend to have a large number of parameters, it not only has the disadvantage of not being able to fit into a mobile device because of the huge model size but also makes it difficult to utilize them for real-time inferencing. In this thesis we try to address both the above mentioned difficulties. We propose an end-to-end trainable model capable of learning both spatial and temporal features present in a gesture video directly from the raw video frames. It is achieved by combining the strengths of 3D-Convolutional Neural Networks and Long Short Term Memory variant of Recurrent Neural Networks. Further, we also explore ways to reduce the parameter space of such models without compromising a lot on performance. Particularly we look at two ways of obtaining compact models, with less number of parameters. Learn smaller models making use of the idea of knowledge distillation and reduce large models’ sizes by performing weight pruning. Our first contribution is learning the joined, end-to-end trainable, 3D-Convolutional Neural Network and Long Short Term Memory. Convolutional Neural Networks preserve both spatial and temporal information over the layers and can identify patterns over short durations. But the inputs need to be of fixed size, which may not always hold true in case of videos. Whereas, Long Short Term Memories face no difficulties in preserving information over longer duration and it can also work with variable length input sequences. However, they do not preserve patterns and hence works better when fed with features that already has learned some amount of spatio-temporal information, instead of just the raw pixel information. The joined model leverages the advantages of both of them. Experimentally we verify as well that, that indeed is the scenario, as our joined model outperforms the individual baseline models.Additionally the components can be pre-trained initially and later fine-tuned in a complete end-to-end fashion to further boost the network’s potential to capture information. We obtain almost state-of-the art result using our proposed model on the ChaLearn-2014 dataset for sign language recognition from videos, but using much simpler model and training mechanism compared to the best model. In our second contribution, we look into ways to learn compact models that enables us to perform real-time inferencing on hand-held devices where power and memory are constraints. To this extent we distill or transfer knowledge from a larger teacher network to a smaller student network. Without teacher supervision, the student network did not have enough capacity to perform well just using class-labels. We demonstrate this on the same ChaLearn-2014 dataset. To the best of our knowledge, this is the first work to explore knowledge distillation from teacher to student network in video classification task. We also show that training networks using Adam optimization technique, combined with weight decay, helps to obtaining sparser models by pruning weights. Training with Adam encourages a lot of weights to become very low by penalizing high weight values and adjusting the learning rate accordingly. Removing the low-valued weights helps to obtain sparser models, compared to SGD (with weight-decay as well) trained models. Experimental results on both gesture recognition task and image classification task on the CIFAR dataset validates the findings.

Year of completion:	July 2018
Advisor :	Anoop M Namboodiri

Related Publications

Koustav Mullick and Anoop M. Namboodiri - Learning Deep and Compact Models for Gesture Recognition 2017 IEEE International Conference on Image Processing, Beijing, China. [PDF]

Downloads

Gender Differences in Facial Emotion Perception for User Profiling via Implicit Behavioral Signals

Maneesh Bilalpur

Abstract

Understanding human emotions has been of research interests to multiple domains of modern day science namely Neuroscience, Psychology and Computer Science. The ultimate goals of each of these domains in studying them might be different such as neuroscientists interest in emotions is primarily to understand the structural and functional abilities of brain, psychologists study them to understand human interactions and computer scientists to design interfaces and automation of certain human-centric tasks. Several earlier works have suggested the existence of two facets to emotions namely perception and expression. It has been advised to study emotions in the aspects of perception and expression as separate entities. This work attempts to study the existence of gender differences in emotion perception(in specfic the Ekman emotions). Our work aims at utilizing such differences for user profiling, particularly in terms of gender and emotion Recognition. We employed implicit signals–the non-invasive electrical scalp activity of brain through Electroencepholography(EEG) and gaze patterns acquired through low-cost commercial devices to achieve these. We studied the impact of facial emotion intensity and facial regions in invoking the differences through stimuli involving of different intensities and masking face regions which were deemed important in previous studies. We expressly examined the implicit signals for their ecological validity. Existence of correlations between our study and previous studies from the above said domains in terms of Event Related Potentials(ERPs) and fixation distributions have added uniqueness and strength to our work. We achieved a reliable gender and emotion recognition with Support Vector Machine based classifiers and further designed a deep learning model to significantly outperform them. We also analyzed for emotion specific time windows and key electrodes for maximum gender recognition to arrive at some interesting conclusions. The appendix chapter on cross-visualization based cognitive workload classification using EEG attempts to quantify workload in order to evaluate user-interfaces. We employ four common yet unique data visualization methods to induce varying levels of workload through a standard n-back task and attempt to classify it across visualizations with deep learning through transfer learning. We compare its performance against the Proximal Support Vector Machines adopted in earlier works for within visualization workload classification.

Year of completion:	July 2018
Advisor :	Ramanathan Subramanian

Related Publications

Downloads

Machine Learning for Source-code Plagiarism Detection

Jitendra Yasaswi Bharadwaj katta

Abstract

This thesis presents a set of machine learning and deep learning approaches for building systems with the goal of source-code plagiarism detection. The task of plagiarism detection can be treated as assessing the amount of similarity presented within given entities. These entities can be anything like documents containing text, source-code etc. Plagiarism detection can be formulated as a fine-grained pattern classification problem. The detection process begins by transforming the entity into feature representations. These features are representatives of their corresponding entities in a discriminative high-dimensional space, where we can measure for similarity. Here, by entity we mean solution to programming assignments in typical computer science courses. The quality of the features determine the quality of detection As our first contribution, we propose a machine learning based approach for plagiarism detection in programming assignments using source-code metrics. Most of the well known plagiarism detectors either employ a text-based approach or use features based on the property of the program at a syntactic level. However, both these approaches succumb to code obfuscation which is a huge obstacle for automatic software plagiarism detection. Our proposed method uses source-code metrics as features, which are extracted from the intermediate representation of a program in a compiler infrastructure such as gcc. We demonstrate the use of unsupervised and supervised learning techniques on the extracted feature representations and show that our system is robust to code obfuscation. We validate our method on assignments from introductory programming course. The preliminary results show that our system is better when compared to other popular tools like MOSS. For visualizing the local and global structure of the features, we obtained the low-dimensional representations of our features using a popular technique called t-SNE, a variation of Stochastic Neighbor Embedding, which can preserve neighborhood identity in low-dimensions. Based on this idea of preserving neighborhood identity, we mine interesting information such as the diversity in student solution approaches to a given problem. The presence of well defined clusters in low-dimensional visualizations demonstrate that our features are capable of capturing interesting programming patterns. As our second contribution, we demonstrate how deep neural networks can be employed to learn features for source-code plagiarism detection. We employ a character-level Recurrent Neural Network (char- RNN ), a character-level language model to map the characters in a source-code to continuous-valued vectors called embeddings. We use these program embeddings as deep features for plagiarismdetection in programming assignments. Many popular plagiarism detection tools are based on n-gram techniques at syntactic level. However, these approaches to plagiarism detection fail to capture long term dependencies (non-contiguous interaction) present in the source-code. Contrarily, the proposed deep features capture non-contiguous interaction within n-grams. These are generic in nature and there is no need to fine-tune the char- RNN model again to program submissions from each individual problem-set. Our experiments show the effectiveness of deep features in the task of classifying assignment program submissions as copy, partial-copy and non-copy. As our final contribution, we demonstrate how to extract local deep features from source-code. We represent programs using local deep features and develop a framework to retrieve suspicious plagiarized cases for a given query program. Such representations are useful for identification of near-duplicate program pairs, where only a part of the program is copied or certain lines, blocks of code may be copied etc. In such cases, obtaining local feature representations for a program is more useful than representing a program with a single global feature. We develop a retrieval framework using Bag of Words (BoW) approach to retrieve susceptible plagiarized and partial-plagiarized (near-duplicate) cases for a given query program.

Year of completion:	July 2018
Advisor :	Prof. C V Jawahar and Suresh Purini

Related Publications

Jitendra Yasaswi, Suresh Purini and C. V. Jawahar - Plagiarism detection in Programming Assignments Using Deep Features 4th Asian Conference on Pattern Recognition (ACPR 2017), Nanjing, China, 2017.[PDF]
Jitendra Yasaswi Bharadwaj katta, Srikailash G, Anil Chilupuri, Suresh Purini and C.V. Jawahar - Unsupervised Learning Based Approach for Plagiarism Detection in Programming Assignments ISEC. 2017. [PDF]

Downloads

Tackling Low Resolution for Better Scene Understanding

Harish Krishna

Abstract

Complete scene understanding has been an aspiration of computer vision since its very early days. It has applications in autonomous navigation, aerial imaging, surveillance, human-computer interaction among several other active areas of research. While many methods since the advent of deep learninghave taken performance in several scene understanding tasks to respectable levels, the tasks are far from being solved. One problem that plagues scene understanding is low-resolution. Convolutional Neural Networks that achieve impressive results on high resolution struggle when confronted with low resolution because of the inability to learn hierarchical features and weakening of signal with depth. In this thesis, we study the low resolution and suggest approaches that can overcome its consequences on three popular tasks - object detection, in-the-wild face recognition, and semantic segmentation. The popular object detectors were designed for, trained, and benchmarked on datasets that have a strong bias towards medium and large sized objects. When these methods are finetuned and tested on a dataset of small objects, they perform miserably. The most successful detection algorithms follow a two-stage pipeline: the first which quickly generates regions of interest that are likely to contain the object and the second, which classifies these proposal regions. We aim to adapt both these stages for the case of small objects; the first by modifying anchor box generation based on theoretical considerations, and the second using a simple-yet-effective super-resolution step. Motivated by the success of being able to detect small objects, we study the problem of detecting and recognising objects with huge variations in resolution, in the problem of face recognition in semi-structured scenes. Semi-structured scenes like social settings are more challenging than regular ones: there are several more faces of vastly different scales, there are large variations in illumination, pose and expression, and the existing datasets do not capture these variations. We address the unique challenges in this setting by (i) benchmarking popular methods for the problem of face detection, and (ii) proposing a method based on resolution-specific networks to handle different scales. Semantic segmentation is a more challenging localisation task where the goal is to assign a semantic class label to every pixel in the image. Solving such a problem is crucial for self-driving cars where we need sharper boundaries for roads, obstacles and paraphernalia. For want of a higher receptive field and a more global view of the image, CNN networks forgo resolution. This results in poor segmentation of complex boundaries, small and thin objects. We propose prefixing a super-resolution step before semantic segmentation. Through experiments, we show that a performance boost can be obtained on the popular streetview segmentation dataset, CityScapes.

Year of completion:	July 2018
Advisor :	Prof. C V Jawahar

Related Publications

Downloads

Combining Class Taxonomies and Multi Task Learning To Regularize Fine-grained Recognition

Riddhiman Dasgupta

Abstract

Fine-grained classification is an extremely challenging problem in computer vision, impaired by subtle differences in shape, pose, illumination and appearance, and further compounded by subtle intra-class differences and striking inter-class similarities. While convolutional neural networks have become versatile jack-of-all-trades tool in modern computer vision, approaches for fine-grained recognition still rely on localization of keypoints and parts to learn discriminative features for recognition. In order to achieve this, most approaches necessitate copious amounts of expensive manual annotations for bounding boxes and keypoints. As a result, most of the current methods inevitably end up becoming complex, multi-stage pipelines, with a deluge of tunable knobs, which makes it infeasible to reproduce them or deploy them for any practical scenario. Since image level annotation is prohibitively expensive for most fine-grained problems, we look at the problem from a rather different perspective, and try to reason about what might be the minimum amount of additional annotation that might be required to obtain an improvement in performance on the challenging task of fine-grained recognition. In order to tackle this problem, we aim to leverage the (taxonomic and/or semantic) relationships present among fine-grained classes. The crux of our proposed approach lies in the notion that fine-grained recognition effectively deals with subordinate-level classification, and as such, subordinated classes imply the presence of inter-class and intra-class relationships. These relationships may be taxonomical, such as super-classes, and/or semantic, such as attributes or factors, and are easily obtainable in the sense that domain expertise is needed for each fine-grained label, not for each image separately. We propose to exploit the rich latent knowledge embedded in these inter-class relationships for visual recognition. We posit the problem as a multi-task learning problem where each different label obtained from inter-class relationships can be treated as a related yet different task for a comprehensive multi-task model. Additional tasks/labels, which might be super-classes or attributes, or factor-classes can act as regularizers, and increase the generalization capabilities of the network. Class relationships are almost always a free source of labels that can be used as auxiliary tasks to train a multi-task loss which is usually a weighted sum of the different individual losses. Multiple tasks will try to take the network in diverging directions, and the network must reach a common minimum by adapting and learning features common to all tasks in its shared layers. Our main contribution is to utilize the taxonomic/semantic hierarchies among classes, where each level in the hierarchy is posed as a classification problem, and solved jointly using multi-task learning. We employ a cascaded multi-task network architecture, where the output of one task feeds into the next, thusenabling transfer of knowledge from the easier tasks to the more difficult ones. To gauge the relative importance of tasks, and apply appropriate learning rates for each task to ensure that the related tasks aid and unrelated tasks does not hamper performance on the primary task, we propose a novel task-wise dynamic coefficient which controls its contribution to the global objective function. We validate our proposed methods for improving fine-grained recognition via multi-task learning using class taxonomies on two datasets, viz. CIFAR 100, which has a simple 2 level hierarchy, albeit a bit noisy, which we use to estimate how robust our proposed approach is to hyperparameter sensitivities, and CUB-200-2011, which has a 4 level hierarchy, and is a more challenging real-world dataset in terms of image size, which we use to see how transferable our proposed approach is to pre-trained networks and fine-tuning. We perform ablation studies on CIFAR 100 to establish the usefulness of multi-task learning using hierarchical labels, and measure the sensitivity of our proposed architectures to different hyperparameters and design choices in an imperfect 2 level hierarchy. Further experiments on the popular, real-world, large-scale, fine-grained CUB-200-2011 dataset with a 4 level hierarchy re-affirm our claim that employing super-classes in an end-to-end model improves performance, compared to methods employing additional expensive annotations such as keypoints and bounding boxes and/or using multi-stage pipelines. We also prove the improved generalization capabilities of our multi-task models, by showing how multiple connected tasks act as regularizers, reducing the gap between training and testing errors. Additionally, we demonstrate how dynamically estimating auxiliary task relatedness and updating auxiliary task coefficients is more optimal than manual hyperparameter tuning for the same purpose.

Year of completion:	July 2018
Advisor :	Prof. Anoop M Namboodiri

Related Publications

Riddhiman Dasgupta and Anoop Namboodiri - Leveraging multiple tasks to regularize fine-grained classification Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016. [PDF]
Koustav Ghosal, Ameya Prabhu, Riddhiman Dasgupta, Anoop M. Namboodiri - Learning Clustered Sub-spaces for Sketch-based Image Retrieval Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition, 03-06 Nov 2015, Kuala Lumpur, Malaysia. [PDF]

Learning Deep and Compact Models for Gesture Recognition

Koustav Mullick

Abstract

Related Publications

Downloads

Gender Differences in Facial Emotion Perception for User Profiling via Implicit Behavioral Signals

Maneesh Bilalpur

Abstract

Related Publications

Downloads

Machine Learning for Source-code Plagiarism Detection

Jitendra Yasaswi Bharadwaj katta

Abstract

Related Publications

Downloads

Tackling Low Resolution for Better Scene Understanding

Harish Krishna

Abstract

Related Publications

Downloads

Combining Class Taxonomies and Multi Task Learning To Regularize Fine-grained Recognition

Riddhiman Dasgupta

Abstract

Related Publications

Downloads

More Articles …