Robust Human Action Modelling

Lei Wang

Research output: ThesisDoctoral thesis

Abstract

Human action recognition is currently one of the most active research areas in computer vision. Various research studies indicate that the performance of action recognition is highly dependent on the type of features being extracted and how the actions are represented. We revive the use of old-fashioned handcrafted video representations for action recognition, e.g., IDT-based BoW/FV representations, and put new life into these techniques via a CNN-based hallucination step. We also design and hallucinate two costly but powerful descriptors, one leveraging four popular object detectors applied to training videos, and the other leveraging image- and video-level saliency detectors. These hallucination-based models are built on the concept of self-supervision by taking RGB frames as input to learn to predict both action concepts and auxiliary descriptors which leads to the state-of-the-art performance for video-based action recognition. For skeleton-based action recognition, inspired by Dynamic Time Warping (DTW) and its differentiable variant soft-DTW in matching pairs of sequences, we (i) introduce the uncertainty-DTW, dubbed as uDTW, whose role is to take into account the uncertainty of in frame-wise (or block-wise) features by selecting the path which maximizes the Maximum Likelihood Estimation (MLE) (ii) propose an advanced variant of DTW which jointly models each smooth path between the query and support frames of human skeleton sequences to achieve simultaneously the best alignment in the temporal and simulated camera viewpoint spaces for end-to-end learning under the limited few-shot training data. These two alignment methods are applied to few-shot skeletal action recognition. As human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics, we (i) propose tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition (ii) form hypergraph to model hyper-edges between graph nodes (which help capture higher-order motion patterns of groups of body joints) and later such embeddings of hyper-edges of different orders are fused through our Multi-order Multi-mode Transformer (3Mformer) which is able to achieve joint-mode attention on joint-mode tokens. These two models yield state-of-the-art results compared to GCN-, transformer- and existing hypergraph-based counterparts for skeletal action recognition.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • The Australian National University
Supervisors/Advisors
  • Koniusz, Piotr, Supervisor, External person
Thesis sponsors
Award date12 Dec 2023
DOIs
Publication statusPublished - Nov 2023

Fingerprint

Dive into the research topics of 'Robust Human Action Modelling'. Together they form a unique fingerprint.

Cite this