Skeleton-based Action Recognition in Non-contextual, In-the-wild and Dense Joint Scenarios

Neel Trivedi

Abstract

Human action recognition, with its irrefutable and varied use cases across fields of surveillance, robotics, human object interaction analysis and many more, has gained critical importance and attention in the field of compute vision. Traditionally entirely based on RGB sequences, action recognition domain has shifted focus towards using skeleton sequences due to the easy availability of skeleton data capturing apparatus and the release of large scale datasets, in recent years. Skeleton based human action recognition, having superiority in terms of privacy, robustness and computational efficiency over traditional RGB based action recognition, is the primary focus of this thesis. Ever since the release of large scale skeleton action datasets namely NTURGB+D and NTURGB+D 120, the community has solely focused on developing complex approaches, ranging from CNNs to complex GCNs and more recently transformers, to achieve the best classification accuracy for these datasets. However, in this rat race for state of the art performance, the community turned a blind eye to a major drawback at the data level which bottlenecks even the most sophisticated approaches. This drawback is where we start our explorations in this thesis. The pose tree provided in the NTURGB+D datasets contains only 25 joints, out of which only 6 joints (3 for each hand) are finger joints. This is a major drawback since only 3 finger level joints are not sufficient enough to distinguish between action categories such as ”Thumbs up” and ”Thumbs down” or ”Make ok sign” and ”Make victory sign”. To specifically address this bottleneck, we introduce two new pose based human action datasets - NTU60-X and NTU120-X. Our datasets extend the largest existing action recognition dataset, NTU-RGBD. In addition to the 25 body joints for each skeleton as in NTURGBD, NTU60-X and NTU120-X dataset include finger and facial joints, enabling a richer skeleton representation. We appropriately modify the state of the art approaches to enable training using the introduced datasets. Our results demonstrate the effectiveness of these NTU-X datasets in overcoming the aforementioned bottleneck and improving the state of the art performance, overall and on previously worst performing action categories. Pose-based action recognition is predominantly tackled by approaches that treat the input skeleton in a monolithic fashion, i.e. joints in the pose tree are processed as a whole. However, such approaches ignore the fact that action categories are often characterized by localized action dynamics involving only small subsets of part joint groups involving hands (e.g. ‘Thumbs up’) or legs (e.g. ‘Kicking’). Although part-grouping based approaches exist, each part group is not considered within the global pose frame, causing such methods to fall short. Further, conventional approaches employ independent modality streams (e.g. joint, bone, joint velocity, bone velocity) and train their network multiple times on these streams, which massively increases the number of training parameters. To address these issues, we introduce PSUMNet, a novel approach for scalable and efficient pose-based action recognition. At the representation level, we propose a global frame based part stream approach as opposed to conventional modality based streams. Within each part stream, the associated data from multiple modalities is unified and consumed by the processing pipeline. Experimentally, PSUMNet achieves the state of the art performance on the widely used NTURGB+D 60/120 dataset and dense joint skeleton dataset NTU 60-X/120-X. PSUMNet is highly efficient and outperforms competing methods which use 100%-400% more parameters. PSUMNet also generalizes to the SHREC hand gesture dataset with competitive performance. Overall, PSUMNet’s scalability, performance and efficiency make it an attractive choice for action recognition and for deployment on compute-restricted embedded and edge devices. Finally, we conclude this thesis by exploring new and more challenging frontiers under the umbrella of skeleton action recognition namely ”in the wild” skeleton action recognition and ”non-contextual” skeleton action recognition. We introduce Skeletics-152, a curated and 3D pose dataset derived from the RGB videos included in the larger Kinetics-700 dataset to explore in the wild skeleton action recognition. We further introduce, Skeleton-mimetics, a 3D pose dataset derived from recently introduced non-contextual action dataset-Mimetics. By benchmarking and analysing various approaches on these two new dataset we lay the ground for future exploration in these two challenging problems within skeleton action recognition. Overall in this thesis, we draw attention to prevailing drawbacks in the existing skeleton action datasets and introduce extensions of these datasets to counter their shortcomings. We also introduce a novel, efficient and highly reliable skeleton action recognition approach dubbed PSUMNet. Finally, we explore more challenging tasks of in the wild and non-contextual action recognition.

Year of completion:	September 2022
Advisor :	Ravi Kiran Sarvadevabhatla