MSTFormer: multi-granularity spatial-temporal transformers for 3D human pose estimation

Abstract The 2D-to-3D lifting approach based on multi-granularity methods effectively captures spatial-temporal features at various scales. Existing multi-granularity methods primarily focus on extracting joint features through graph-based approaches, which abstract information at different levels....

Full description

Bibliographic Details
Published in:	Journal of King Saud University: Computer and Information Sciences
Main Authors:	Hao Lin, Sheng Xu, Chengyue Su
Format:	Article
Language:	English
Published:	Springer 2025-04-01
Subjects:	3D human pose estimation Multi-granularity method Skeleton structured information
Online Access:	https://doi.org/10.1007/s44443-025-00023-4

Description
Summary:	Abstract The 2D-to-3D lifting approach based on multi-granularity methods effectively captures spatial-temporal features at various scales. Existing multi-granularity methods primarily focus on extracting joint features through graph-based approaches, which abstract information at different levels. However, these methods often overlook the structured information inherent in skeleton sequences, such as global connectivity, continuous motion trajectories, and temporal context relationships. To address these limitations, we propose a novel method, Multi-granularity Spatial-Temporal Transformers (MSTFormer), for spatial-temporal feature extraction and fusion that leverages the structured information in skeleton sequences. First, the Multi-granularity Spatial Transformer Module constructs hierarchical feature representations of joints, bones, and limbs using Spatial-Temporal Pooling. The Multi-level Spatial Transformer Encoder then extracts the spatial structured information of the skeleton. Next, the Multi-granularity Temporal Transformer Module utilizes the Attention-Enhanced Temporal Transformer Encoder to model the temporal context relationships of the multi-granularity spatial features. Finally, the Multi-granularity Feature Fusion Module integrates these spatial-temporal features, generating accurate 3D pose representations. We introduce a new Multi-granularity Loss Function to align and balance the multi-granularity representations across different levels. Experimental results demonstrate that MSTFormer outperforms state-of-the-art methods on the Human3.6M and HumanEva-I datasets, achieving superior performance with fewer parameters.
ISSN:	1319-1578 2213-1248

MSTFormer: multi-granularity spatial-temporal transformers for 3D human pose estimation

Similar Items