Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation
While human activity recognition and pose estimation are closely related, these two issues are usually treated as separate tasks. In this thesis, two-dimension and three-dimension pose estimation is obtained for human activity recognition in a video sequence, and final activity is determined by comb...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-05-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/11/9/4153 |
id |
doaj-1980bbef3838436ba29b6287af4921d9 |
---|---|
record_format |
Article |
spelling |
doaj-1980bbef3838436ba29b6287af4921d92021-05-31T23:02:25ZengMDPI AGApplied Sciences2076-34172021-05-01114153415310.3390/app11094153Activity Recognition with Combination of Deeply Learned Visual Attention and Pose EstimationJisu Kim0Deokwoo Lee1Department of Computer Engineering, Keimyung University, Daegu 42601, KoreaDepartment of Computer Engineering, Keimyung University, Daegu 42601, KoreaWhile human activity recognition and pose estimation are closely related, these two issues are usually treated as separate tasks. In this thesis, two-dimension and three-dimension pose estimation is obtained for human activity recognition in a video sequence, and final activity is determined by combining it with an activity algorithm with visual attention. Two problems can be solved efficiently using a single architecture. It is also shown that end-to-end optimization leads to much higher accuracy than separated learning. The proposed architecture can be trained seamlessly with different categories of data. For visual attention, soft visual attention is used, and a multilayer recurrent neural network using long short term memory that can be used both temporally and spatially is used. The image, pose estimated skeleton, and RGB-based activity recognition data are all synthesized to determine the final activity to increase reliability. Visual attention evaluates the model in UCF-11 (Youtube Action), HMDB-51 and Hollywood2 data sets, and analyzes how to focus according to the scene and task the model is performing. Pose estimation and activity recognition are tested and analyzed on MPII, Human3.6M, Penn Action and NTU data sets. Test results are Penn Action 98.9%, NTU 87.9%, and NW-UCLA 88.6%.https://www.mdpi.com/2076-3417/11/9/4153activity recognitiondeep neural networkvisual attentionpose estimation |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Jisu Kim Deokwoo Lee |
spellingShingle |
Jisu Kim Deokwoo Lee Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation Applied Sciences activity recognition deep neural network visual attention pose estimation |
author_facet |
Jisu Kim Deokwoo Lee |
author_sort |
Jisu Kim |
title |
Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation |
title_short |
Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation |
title_full |
Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation |
title_fullStr |
Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation |
title_full_unstemmed |
Activity Recognition with Combination of Deeply Learned Visual Attention and Pose Estimation |
title_sort |
activity recognition with combination of deeply learned visual attention and pose estimation |
publisher |
MDPI AG |
series |
Applied Sciences |
issn |
2076-3417 |
publishDate |
2021-05-01 |
description |
While human activity recognition and pose estimation are closely related, these two issues are usually treated as separate tasks. In this thesis, two-dimension and three-dimension pose estimation is obtained for human activity recognition in a video sequence, and final activity is determined by combining it with an activity algorithm with visual attention. Two problems can be solved efficiently using a single architecture. It is also shown that end-to-end optimization leads to much higher accuracy than separated learning. The proposed architecture can be trained seamlessly with different categories of data. For visual attention, soft visual attention is used, and a multilayer recurrent neural network using long short term memory that can be used both temporally and spatially is used. The image, pose estimated skeleton, and RGB-based activity recognition data are all synthesized to determine the final activity to increase reliability. Visual attention evaluates the model in UCF-11 (Youtube Action), HMDB-51 and Hollywood2 data sets, and analyzes how to focus according to the scene and task the model is performing. Pose estimation and activity recognition are tested and analyzed on MPII, Human3.6M, Penn Action and NTU data sets. Test results are Penn Action 98.9%, NTU 87.9%, and NW-UCLA 88.6%. |
topic |
activity recognition deep neural network visual attention pose estimation |
url |
https://www.mdpi.com/2076-3417/11/9/4153 |
work_keys_str_mv |
AT jisukim activityrecognitionwithcombinationofdeeplylearnedvisualattentionandposeestimation AT deokwoolee activityrecognitionwithcombinationofdeeplylearnedvisualattentionandposeestimation |
_version_ |
1721418472321187840 |