Text this: MSTFormer: multi-granularity spatial-temporal transformers for 3D human pose estimation