TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition

Modeling spatiotemporal representations is one of the most essential yet challenging issues in video action recognition. Existing methods lack the capacity to accurately model either the correlations between spatial and temporal features or the global temporal dependencies. Inspired by the two-strea...

Full description

Bibliographic Details
Main Authors:	Xiao Wu, Qingge Ji
Format:	Article
Language:	English
Published:	MDPI AG 2020-07-01
Series:	Algorithms
Subjects:	action recognition bidirectional long short-term memory residual connection temporal attention mechanism two-stream networks
Online Access:	https://www.mdpi.com/1999-4893/13/7/169

id	doaj-86f1e0f1164c4a7984b5b06b7aef645f
record_format	Article
spelling	doaj-86f1e0f1164c4a7984b5b06b7aef645f2020-11-25T03:31:13ZengMDPI AGAlgorithms1999-48932020-07-011316916910.3390/a13070169TBRNet: Two-Stream BiLSTM Residual Network for Video Action RecognitionXiao Wu0Qingge Ji1School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, ChinaSchool of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, ChinaModeling spatiotemporal representations is one of the most essential yet challenging issues in video action recognition. Existing methods lack the capacity to accurately model either the correlations between spatial and temporal features or the global temporal dependencies. Inspired by the two-stream network for video action recognition, we propose an encoder–decoder framework named Two-Stream Bidirectional Long Short-Term Memory (LSTM) Residual Network (TBRNet) which takes advantage of the interaction between spatiotemporal representations and global temporal dependencies. In the encoding phase, the two-stream architecture, based on the proposed Residual Convolutional 3D (Res-C3D) network, extracts features with residual connections inserted between the two pathways, and then the features are fused to become the short-term spatiotemporal features of the encoder. In the decoding phase, those short-term spatiotemporal features are first fed into a temporal attention-based bidirectional LSTM (BiLSTM) network to obtain long-term bidirectional attention-pooling dependencies. Subsequently, those temporal dependencies are integrated with short-term spatiotemporal features to obtain global spatiotemporal relationships. On two benchmark datasets, UCF101 and HMDB51, we verified the effectiveness of our proposed TBRNet by a series of experiments, and it achieved competitive or even better results compared with existing state-of-the-art approaches.https://www.mdpi.com/1999-4893/13/7/169action recognitionbidirectional long short-term memoryresidual connectiontemporal attention mechanismtwo-stream networks
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Xiao Wu Qingge Ji
spellingShingle	Xiao Wu Qingge Ji TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition Algorithms action recognition bidirectional long short-term memory residual connection temporal attention mechanism two-stream networks
author_facet	Xiao Wu Qingge Ji
author_sort	Xiao Wu
title	TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition
title_short	TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition
title_full	TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition
title_fullStr	TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition
title_full_unstemmed	TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition
title_sort	tbrnet: two-stream bilstm residual network for video action recognition
publisher	MDPI AG
series	Algorithms
issn	1999-4893
publishDate	2020-07-01
description	Modeling spatiotemporal representations is one of the most essential yet challenging issues in video action recognition. Existing methods lack the capacity to accurately model either the correlations between spatial and temporal features or the global temporal dependencies. Inspired by the two-stream network for video action recognition, we propose an encoder–decoder framework named Two-Stream Bidirectional Long Short-Term Memory (LSTM) Residual Network (TBRNet) which takes advantage of the interaction between spatiotemporal representations and global temporal dependencies. In the encoding phase, the two-stream architecture, based on the proposed Residual Convolutional 3D (Res-C3D) network, extracts features with residual connections inserted between the two pathways, and then the features are fused to become the short-term spatiotemporal features of the encoder. In the decoding phase, those short-term spatiotemporal features are first fed into a temporal attention-based bidirectional LSTM (BiLSTM) network to obtain long-term bidirectional attention-pooling dependencies. Subsequently, those temporal dependencies are integrated with short-term spatiotemporal features to obtain global spatiotemporal relationships. On two benchmark datasets, UCF101 and HMDB51, we verified the effectiveness of our proposed TBRNet by a series of experiments, and it achieved competitive or even better results compared with existing state-of-the-art approaches.
topic	action recognition bidirectional long short-term memory residual connection temporal attention mechanism two-stream networks
url	https://www.mdpi.com/1999-4893/13/7/169
work_keys_str_mv	AT xiaowu tbrnettwostreambilstmresidualnetworkforvideoactionrecognition AT qinggeji tbrnettwostreambilstmresidualnetworkforvideoactionrecognition
_version_	1724572824490737664

TBRNet: Two-Stream BiLSTM Residual Network for Video Action Recognition

Similar Items