Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition

Skeleton-based human action recognition (HAR) plays an important role in video analytics and recognition systems, with the goal of accurately identifying human actions in videos. However, large-scale action annotation is costly, which has led to the growing interest in HAR research using self-superv...

وصف كامل

التفاصيل البيبلوغرافية
الحاوية / القاعدة:	IEEE Access
المؤلفون الرئيسيون:	Jinhyeok Park, Seoung Bum Kim
التنسيق:	مقال
اللغة:	الإنجليزية
منشور في:	IEEE 2025-01-01
الموضوعات:	Skeleton-based human action recognition self-supervised learning skeleton-specific transformation graph representation learning non-contrastive learning
الوصول للمادة أونلاين:	https://ieeexplore.ieee.org/document/10945847/

الوصف
الملخص:	Skeleton-based human action recognition (HAR) plays an important role in video analytics and recognition systems, with the goal of accurately identifying human actions in videos. However, large-scale action annotation is costly, which has led to the growing interest in HAR research using self-supervised learning (SSL). While existing SSL studies have focused on extracting global information from skeleton sequences, they often overlook local information that captures the relationships between joints and their subtle movements over time. In this study, we propose an SSL-based HAR framework called coarse-to-fine spatiotemporal representation masking (CFSEM) that effectively learns global, local, and temporal information within skeletal. CFSEM captures not only global information in the skeleton using body- and part-level masking but also fine-grained movements using hand masking. In addition, temporal-axis shuffling is introduced into the proposed framework to account for temporal patterns inherent in skeleton sequences. To further enhance the learning process, the loss function is redefined using a cross-correlation matrix, introducing a non-contrastive SSL approach. Experiments on various datasets were conducted to evaluate the proposed framework against baseline methods. Experimental results showed the superior performance of CFSEM and highlighted the possibility of training HAR models using less labeled data, offering the potential to effectively develop HAR models for various industries.
تدمد:	2169-3536

Self-Supervised Spatiotemporal Representation Learning for Skeleton-Based Human Action Recognition

مواد مشابهة