Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is...

Full description

Bibliographic Details
Main Authors: Mei Chee Leong, Dilip K. Prasad, Yong Tsui Lee, Feng Lin
Format: Article
Language:English
Published: MDPI AG 2020-01-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/10/2/557
id doaj-9973a0446d414e51aca0107454db7e77
record_format Article
spelling doaj-9973a0446d414e51aca0107454db7e772020-11-25T02:20:25ZengMDPI AGApplied Sciences2076-34172020-01-0110255710.3390/app10020557app10020557Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action RecognitionMei Chee Leong0Dilip K. Prasad1Yong Tsui Lee2Feng Lin3Institute for Media Innovation, Interdisciplinary Graduate School, Nanyang Technological University, Singapore 639798, SingaporeDepartment of Computer Science, UiT The Artic University of Norway, 9019 Tromsø, NorwaySchool of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore 639798, SingaporeSchool of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, SingaporeThis paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16−30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.https://www.mdpi.com/2076-3417/10/2/557action recognitionspatio-temporal featuresconvolution networktransfer learning
collection DOAJ
language English
format Article
sources DOAJ
author Mei Chee Leong
Dilip K. Prasad
Yong Tsui Lee
Feng Lin
spellingShingle Mei Chee Leong
Dilip K. Prasad
Yong Tsui Lee
Feng Lin
Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
Applied Sciences
action recognition
spatio-temporal features
convolution network
transfer learning
author_facet Mei Chee Leong
Dilip K. Prasad
Yong Tsui Lee
Feng Lin
author_sort Mei Chee Leong
title Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
title_short Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
title_full Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
title_fullStr Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
title_full_unstemmed Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition
title_sort semi-cnn architecture for effective spatio-temporal learning in action recognition
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2020-01-01
description This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16−30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.
topic action recognition
spatio-temporal features
convolution network
transfer learning
url https://www.mdpi.com/2076-3417/10/2/557
work_keys_str_mv AT meicheeleong semicnnarchitectureforeffectivespatiotemporallearninginactionrecognition
AT dilipkprasad semicnnarchitectureforeffectivespatiotemporallearninginactionrecognition
AT yongtsuilee semicnnarchitectureforeffectivespatiotemporallearninginactionrecognition
AT fenglin semicnnarchitectureforeffectivespatiotemporallearninginactionrecognition
_version_ 1724871548138946560