Geometric + Kinematic Priors and Part-based Graph Convolutional Network for Skeleton-based Human Action Recognition
Videos engulf the media archive in every capacity, generating a rich and vast source of information from which machines can learn how to understand such data and predict useful attributes that can help technologies make human lives better. One of the most significant and instrumental part of a computer vision system is the comprehension of human actions from visual sequences – a paragon of machine intelligence that can be achieved through computer vision. The problem of human action recognition has high importance as it facilitates several applications built around recognition of human actions. Understanding actions from monocular sequences has been studied immensely, while comprehending human actions from a skeleton video has developed in recent times. We propose action recognition frameworks that use the skeletal data (viz. 3D locations of some joints) of human body to learn spatio-temporal representations necessary for recognizing actions. Information about human actions is composed of two integral dimensions, space and time, along which the variations occur. Spatial representation of a human action aims at encapsulating the config- uration of human body essential to the action, while the temporal representation aims at capturing the evolution of such configurations across time. To this end, we propose to use geometric relations betweenhuman skeleton joints to discern the body pose relative to the action and physically inspired kinematic quantities in order to understand the temporal evolution of body pose. Spatio-temporal understanding of human actions is thus conceived as a comprehension of geometric and kinematic information with the help of machine learning frameworks. Using a representation inculcating an amalgamation of geo- metric and kinematic features, we recognize human actions from skeleton videos (S-videos) using such frameworks. We first present a non-parametric approach for temporal sub-segmentation of trimmed action videos using angular momentum trajectory of the skeletal pose sequence. Meaningful summarization of the pose sequence is a product of the sub-segmentation achieved through systematic sampling of the seg- ments. Descriptors capturing geometric and kinematic statistics encoded as histograms and spread across a periodic range of orientations are computed to represent the summarized pose sequences, which are fed to a kernelized classifier for recognizing actions. This framework for understanding human ac- tions instils the effects of using geometric and kinematic properties of human pose sequences, important to spatio-temporal modeling of the actions. However, a downside of this framework is the inability to scale with availability of large amount of visual data.To mitigate the impending drawback, we next present geometric deep learning frameworks and specifically, graph convolutional networks, for the same task. Representation of human skeleton as a sparse spatial graph is intuitive and a structured form which lies on graph manifolds. A human action video hence results in the formation of a spatio-temporal graph and graph convolutions facilitate the learning of a spatio-temporal descriptor for the action. Inspired by the success of Deformable Part-based Models (DPMs) for the task of object understanding from images, we propose a part-based graph con- volutional network (PB-GCN) that operates on a human skeletal graph divided into parts. Incorporating the culmination of understandings from the success of geometric and kinematic features, we propose to use relative coordinates and temporal displacements of the 3D joints coordinates as features at the vertices in the skeletal graph. Owing to these signals, the prowess of graph convolutional networks is further boosted to attain state-of-the-art performance among several action recognition systems using skeleton videos. In this thesis, we meticulously examine the growth of the idea about using geometry and kinematics, transition to geometric deep learning frameworks and design a PB-GCN with geometric + kinematic signals at the vertices, for the task of human action recognition using skeleton videos.
|Year of completion:||March 2019|
|Advisor :||P J Narayanan|