Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is...

Full description

Bibliographic Details
Main Authors:	Mei Chee Leong, Dilip K. Prasad, Yong Tsui Lee, Feng Lin
Format:	Article
Language:	English
Published:	MDPI AG 2020-01-01
Series:	Applied Sciences
Subjects:	action recognition spatio-temporal features convolution network transfer learning
Online Access:	https://www.mdpi.com/2076-3417/10/2/557

id	doaj-9973a0446d414e51aca0107454db7e77
record_format	Article
spelling	doaj-9973a0446d414e51aca0107454db7e772020-11-25T02:20:25ZengMDPI AGApplied Sciences2076-34172020-01-0110255710.3390/app10020557app10020557Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action RecognitionMei Chee Leong0Dilip K. Prasad1Yong Tsui Lee2Feng Lin3Institute for Media Innovation, Interdisciplinary Graduate School, Nanyang Technological University, Singapore 639798, SingaporeDepartment of Computer Science, UiT The Artic University of Norway, 9019 Tromsø, NorwaySchool of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore 639798, SingaporeSchool of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, SingaporeThis paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16−30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.https://www.mdpi.com/2076-3417/10/2/557action recognitionspatio-temporal featuresconvolution networktransfer learning
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Mei Chee Leong Dilip K. Prasad Yong Tsui Lee Feng Lin
spellingShingle	Mei Chee Leong Dilip K. Prasad Yong Tsui Lee Feng Lin Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition Applied Sciences action recognition spatio-temporal features convolution network transfer learning
author_facet	Mei Chee Leong Dilip K. Prasad Yong Tsui Lee Feng Lin
author_sort	Mei Chee Leong
title	Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
title_short	Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
title_full	Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
title_fullStr	Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
title_full_unstemmed	Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
title_sort	semi-cnn architecture for effective spatio-temporal learning in action recognition
publisher	MDPI AG
series	Applied Sciences
issn	2076-3417
publishDate	2020-01-01
description	This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16−30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.
topic	action recognition spatio-temporal features convolution network transfer learning
url	https://www.mdpi.com/2076-3417/10/2/557
work_keys_str_mv	AT meicheeleong semicnnarchitectureforeffectivespatiotemporallearninginactionrecognition AT dilipkprasad semicnnarchitectureforeffectivespatiotemporallearninginactionrecognition AT yongtsuilee semicnnarchitectureforeffectivespatiotemporallearninginactionrecognition AT fenglin semicnnarchitectureforeffectivespatiotemporallearninginactionrecognition
_version_	1724871548138946560

Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

Similar Items