Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation

While human activity recognition and pose estimation are closely related, these two issues are usually treated as separate tasks. In this thesis, two-dimension and three-dimension pose estimation is obtained for human activity recognition in a video sequence, and final activity is determined by comb...

Full description

Bibliographic Details
Main Authors: Jisu Kim, Deokwoo Lee
Format: Article
Language:English
Published: MDPI AG 2021-05-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/9/4153
id doaj-1980bbef3838436ba29b6287af4921d9
record_format Article
spelling doaj-1980bbef3838436ba29b6287af4921d92021-05-31T23:02:25ZengMDPI AGApplied Sciences2076-34172021-05-01114153415310.3390/app11094153Activity Recognition with Combination of Deeply Learned Visual Attention and Pose EstimationJisu Kim0Deokwoo Lee1Department of Computer Engineering, Keimyung University, Daegu 42601, KoreaDepartment of Computer Engineering, Keimyung University, Daegu 42601, KoreaWhile human activity recognition and pose estimation are closely related, these two issues are usually treated as separate tasks. In this thesis, two-dimension and three-dimension pose estimation is obtained for human activity recognition in a video sequence, and final activity is determined by combining it with an activity algorithm with visual attention. Two problems can be solved efficiently using a single architecture. It is also shown that end-to-end optimization leads to much higher accuracy than separated learning. The proposed architecture can be trained seamlessly with different categories of data. For visual attention, soft visual attention is used, and a multilayer recurrent neural network using long short term memory that can be used both temporally and spatially is used. The image, pose estimated skeleton, and RGB-based activity recognition data are all synthesized to determine the final activity to increase reliability. Visual attention evaluates the model in UCF-11 (Youtube Action), HMDB-51 and Hollywood2 data sets, and analyzes how to focus according to the scene and task the model is performing. Pose estimation and activity recognition are tested and analyzed on MPII, Human3.6M, Penn Action and NTU data sets. Test results are Penn Action 98.9%, NTU 87.9%, and NW-UCLA 88.6%.https://www.mdpi.com/2076-3417/11/9/4153activity recognitiondeep neural networkvisual attentionpose estimation
collection DOAJ
language English
format Article
sources DOAJ
author Jisu Kim
Deokwoo Lee
spellingShingle Jisu Kim
Deokwoo Lee
Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation
Applied Sciences
activity recognition
deep neural network
visual attention
pose estimation
author_facet Jisu Kim
Deokwoo Lee
author_sort Jisu Kim
title Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation
title_short Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation
title_full Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation
title_fullStr Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation
title_full_unstemmed Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation
title_sort activity recognition with combination of deeply learned visual attention and pose estimation
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2021-05-01
description While human activity recognition and pose estimation are closely related, these two issues are usually treated as separate tasks. In this thesis, two-dimension and three-dimension pose estimation is obtained for human activity recognition in a video sequence, and final activity is determined by combining it with an activity algorithm with visual attention. Two problems can be solved efficiently using a single architecture. It is also shown that end-to-end optimization leads to much higher accuracy than separated learning. The proposed architecture can be trained seamlessly with different categories of data. For visual attention, soft visual attention is used, and a multilayer recurrent neural network using long short term memory that can be used both temporally and spatially is used. The image, pose estimated skeleton, and RGB-based activity recognition data are all synthesized to determine the final activity to increase reliability. Visual attention evaluates the model in UCF-11 (Youtube Action), HMDB-51 and Hollywood2 data sets, and analyzes how to focus according to the scene and task the model is performing. Pose estimation and activity recognition are tested and analyzed on MPII, Human3.6M, Penn Action and NTU data sets. Test results are Penn Action 98.9%, NTU 87.9%, and NW-UCLA 88.6%.
topic activity recognition
deep neural network
visual attention
pose estimation
url https://www.mdpi.com/2076-3417/11/9/4153
work_keys_str_mv AT jisukim activityrecognitionwithcombinationofdeeplylearnedvisualattentionandposeestimation
AT deokwoolee activityrecognitionwithcombinationofdeeplylearnedvisualattentionandposeestimation
_version_ 1721418472321187840