Skeleton Feature Fusion Based on Multi-Stream LSTM for Action Recognition

Human action recognition from skeleton sequences has attracted a lot of attention in the computer vision community. Long short term memory (LSTM) network has shown its promising performance for this problem, due to their strengths in modeling the dependencies and temporal dynamics of sequential data...

Full description

Bibliographic Details
Main Authors: Lei Wang, Xu Zhao, Yuncai Liu
Format: Article
Language:English
Published: IEEE 2018-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8463451/
id doaj-7cb32b0cb3d241b1b086fc8724546dcb
record_format Article
spelling doaj-7cb32b0cb3d241b1b086fc8724546dcb2021-03-29T20:58:25ZengIEEEIEEE Access2169-35362018-01-016507885080010.1109/ACCESS.2018.28697518463451Skeleton Feature Fusion Based on Multi-Stream LSTM for Action RecognitionLei Wang0https://orcid.org/0000-0001-5873-569XXu Zhao1https://orcid.org/0000-0002-8176-623XYuncai Liu2School of Information Science and Engineering, Shandong University, Jinan, ChinaInstitute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, ChinaInstitute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, ChinaHuman action recognition from skeleton sequences has attracted a lot of attention in the computer vision community. Long short term memory (LSTM) network has shown its promising performance for this problem, due to their strengths in modeling the dependencies and temporal dynamics of sequential data. However, original LSTM is difficult to grasp the dynamics of entire sequence data, if the input feature of each time step is just a simple combination of raw skeleton data. In this paper, we present a fusion model to make full use of the skeleton data through multi-stream LSTM for action recognition. In each stream of the model, skeleton feature fed to each step of LSTM are extracted from different time duration, which are called single frame feature, short term feature, and long term feature, respectively. Single frame feature represents static pose, which is converted from joints coordinates directly. Short term feature represents skeleton kinematics, which is extracted from a short time window. Long term feature represents joints mutuality during the action process, which is extracted from a longer time window. All these features are modeled by LSTM, and the final states of LSTM streams are fused to predict the underlying actions. The proposed model makes better use of the skeleton dynamics than standard LSTM model. Experimental results on two benchmark skeleton data sets NTU RGB+D data set and SBU interaction dataset show that our proposed approach achieved significant performance.https://ieeexplore.ieee.org/document/8463451/Action recognitionlong short term memory networkskeleton feature fusion
collection DOAJ
language English
format Article
sources DOAJ
author Lei Wang
Xu Zhao
Yuncai Liu
spellingShingle Lei Wang
Xu Zhao
Yuncai Liu
Skeleton Feature Fusion Based on Multi-Stream LSTM for Action Recognition
IEEE Access
Action recognition
long short term memory network
skeleton feature fusion
author_facet Lei Wang
Xu Zhao
Yuncai Liu
author_sort Lei Wang
title Skeleton Feature Fusion Based on Multi-Stream LSTM for Action Recognition
title_short Skeleton Feature Fusion Based on Multi-Stream LSTM for Action Recognition
title_full Skeleton Feature Fusion Based on Multi-Stream LSTM for Action Recognition
title_fullStr Skeleton Feature Fusion Based on Multi-Stream LSTM for Action Recognition
title_full_unstemmed Skeleton Feature Fusion Based on Multi-Stream LSTM for Action Recognition
title_sort skeleton feature fusion based on multi-stream lstm for action recognition
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2018-01-01
description Human action recognition from skeleton sequences has attracted a lot of attention in the computer vision community. Long short term memory (LSTM) network has shown its promising performance for this problem, due to their strengths in modeling the dependencies and temporal dynamics of sequential data. However, original LSTM is difficult to grasp the dynamics of entire sequence data, if the input feature of each time step is just a simple combination of raw skeleton data. In this paper, we present a fusion model to make full use of the skeleton data through multi-stream LSTM for action recognition. In each stream of the model, skeleton feature fed to each step of LSTM are extracted from different time duration, which are called single frame feature, short term feature, and long term feature, respectively. Single frame feature represents static pose, which is converted from joints coordinates directly. Short term feature represents skeleton kinematics, which is extracted from a short time window. Long term feature represents joints mutuality during the action process, which is extracted from a longer time window. All these features are modeled by LSTM, and the final states of LSTM streams are fused to predict the underlying actions. The proposed model makes better use of the skeleton dynamics than standard LSTM model. Experimental results on two benchmark skeleton data sets NTU RGB+D data set and SBU interaction dataset show that our proposed approach achieved significant performance.
topic Action recognition
long short term memory network
skeleton feature fusion
url https://ieeexplore.ieee.org/document/8463451/
work_keys_str_mv AT leiwang skeletonfeaturefusionbasedonmultistreamlstmforactionrecognition
AT xuzhao skeletonfeaturefusionbasedonmultistreamlstmforactionrecognition
AT yuncailiu skeletonfeaturefusionbasedonmultistreamlstmforactionrecognition
_version_ 1724193740008980480