Pose Based Action Recognition: Novel Frontiers for Fully Supervised Learning and Language aided Generalised Zero-Shot Learning
Action recognition is indispensable not only to the umbrella field of computer vision but in multitudes of allied fields such as video surveillance, human computer interaction, robotics and human robot interaction. Typically action recognition is performed over RGB videos, however, in recent years skeleton action recognition has also gained a lot of traction. Much of it is owed to the development of frugal motion capture systems, which enabled the curation of large scale skeleton action datasets. In this thesis, we focus on skeleton-based human action recognition. We begin our explorations with skeleton-action recognition in the wild by introducing Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset. We extend our study to include out-of-context actions by introducing Skeleton-Mimetics, a dataset derived from the recently introduced Mimetics dataset. We also introduce Metaphorics, a dataset with caption-style annotated YouTube videos of the popular social game Dumb Charades and interpretative dance performances. We benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. The results from benchmarking the top performers of NTU-120 on the newly introduced datasets reveal the challenges and domain gap induced by actions in the wild. Overall, our work characterizes the strengths and limitations of existing approaches and datasets. Via the introduced datasets, our work enables new frontiers for human action recognition. When moving ahead from the traditional supervised recognition to the more challenging zero-shot recognition, the language component of the action name becomes important. Focusing on this, we introduce SynSE, a novel syntactically guided generative approach for Zero-Shot Learning (ZSL). Our end-to-end approach learns progressively refined generative embedding spaces constrained within and across the involved modalities (visual, language). The inter-modal constraints are defined between action sequence embedding and embeddings of Parts of Speech (PoS) tagged words in the corresponding action description. We deploy SynSE for the task of skeleton-based action sequence recognition. Our design choices enable SynSE to generalize compositionally, i.e., recognize sequences whose action descriptions contain words not encountered during training. We also extend our approach to the more challenging Generalized Zero-Shot Learning (GZSL) problem via a confidence-based gating mechanism. We are the first to present zero-shot skeleton action recognition results on the large scale NTU-60 and NTU-120 skeleton action datasets with multiple splits. Our results demonstrate SynSE’s state of the art performance in both ZSL and GZSL settings compared to strong baselines on the NTU-60 and NTU-120 datasets. 3-D virtual avatars are extensively used in gaming, educational animation and physical exercise coaching applications. The development of these visually interactive systems relies heavily of how humans perceive the actions performed by the 3-D avatars. To this end, we perform a short user-study to gain insights into the recognizability of human actions performed virtually. Our results reveal that actions performed by 3-D avatars are significantly easier to recognize as compared to those performed 3-D skeletons. Concrete actions, i.e actions which can only be performed using a fixed set of movements are recognized more quickly and accurately as compared to abstract actions. Overall, in this thesis we study various new-frontiers in skeleton action recognition by means of novel datasets and tasks. We drift from unimodal approaches to a multimodal setup by incorporating language in a unique syntactically aware fashion with hopes of utilizing similar ideas in more challenging problems like skeleton action generation.
|Year of completion:
|Ravi Kiran Sarvadevabhatla