Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation

Inspired by the recent spatio-temporal action localization efforts with tubelets (sequences of bounding boxes), we present a new spatio-temporal action localization detector Segment-tube, which consists of sequences of per-frame segmentation masks. The proposed Segment-tube detector can temporally p...

Full description

Bibliographic Details
Main Authors: Le Wang, Xuhuan Duan, Qilin Zhang, Zhenxing Niu, Gang Hua, Nanning Zheng
Format: Article
Language:English
Published: MDPI AG 2018-05-01
Series:Sensors
Subjects:
Online Access:http://www.mdpi.com/1424-8220/18/5/1657
id doaj-57fd4fb741dd405abdb18b01dd8cdfd6
record_format Article
spelling doaj-57fd4fb741dd405abdb18b01dd8cdfd62020-11-24T21:18:58ZengMDPI AGSensors1424-82202018-05-01185165710.3390/s18051657s18051657Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame SegmentationLe Wang0Xuhuan Duan1Qilin Zhang2Zhenxing Niu3Gang Hua4Nanning Zheng5Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, Shannxi 710049, ChinaInstitute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, Shannxi 710049, ChinaHERE Technologies, Chicago, IL 60606, USAAlibaba Group, Hangzhou 311121, ChinaMicrosoft Research, Redmond, WA 98052, USAInstitute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an, Shannxi 710049, ChinaInspired by the recent spatio-temporal action localization efforts with tubelets (sequences of bounding boxes), we present a new spatio-temporal action localization detector Segment-tube, which consists of sequences of per-frame segmentation masks. The proposed Segment-tube detector can temporally pinpoint the starting/ending frame of each action category in the presence of preceding/subsequent interference actions in untrimmed videos. Simultaneously, the Segment-tube detector produces per-frame segmentation masks instead of bounding boxes, offering superior spatial accuracy to tubelets. This is achieved by alternating iterative optimization between temporal action localization and spatial action segmentation. Experimental results on three datasets validated the efficacy of the proposed method, including (1) temporal action localization on the THUMOS 2014 dataset; (2) spatial action segmentation on the Segtrack dataset; and (3) joint spatio-temporal action localization on the newly proposed ActSeg dataset. It is shown that our method compares favorably with existing state-of-the-art methods.http://www.mdpi.com/1424-8220/18/5/1657action localizationaction segmentation3D ConvNetsLSTM
collection DOAJ
language English
format Article
sources DOAJ
author Le Wang
Xuhuan Duan
Qilin Zhang
Zhenxing Niu
Gang Hua
Nanning Zheng
spellingShingle Le Wang
Xuhuan Duan
Qilin Zhang
Zhenxing Niu
Gang Hua
Nanning Zheng
Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation
Sensors
action localization
action segmentation
3D ConvNets
LSTM
author_facet Le Wang
Xuhuan Duan
Qilin Zhang
Zhenxing Niu
Gang Hua
Nanning Zheng
author_sort Le Wang
title Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation
title_short Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation
title_full Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation
title_fullStr Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation
title_full_unstemmed Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation
title_sort segment-tube: spatio-temporal action localization in untrimmed videos with per-frame segmentation
publisher MDPI AG
series Sensors
issn 1424-8220
publishDate 2018-05-01
description Inspired by the recent spatio-temporal action localization efforts with tubelets (sequences of bounding boxes), we present a new spatio-temporal action localization detector Segment-tube, which consists of sequences of per-frame segmentation masks. The proposed Segment-tube detector can temporally pinpoint the starting/ending frame of each action category in the presence of preceding/subsequent interference actions in untrimmed videos. Simultaneously, the Segment-tube detector produces per-frame segmentation masks instead of bounding boxes, offering superior spatial accuracy to tubelets. This is achieved by alternating iterative optimization between temporal action localization and spatial action segmentation. Experimental results on three datasets validated the efficacy of the proposed method, including (1) temporal action localization on the THUMOS 2014 dataset; (2) spatial action segmentation on the Segtrack dataset; and (3) joint spatio-temporal action localization on the newly proposed ActSeg dataset. It is shown that our method compares favorably with existing state-of-the-art methods.
topic action localization
action segmentation
3D ConvNets
LSTM
url http://www.mdpi.com/1424-8220/18/5/1657
work_keys_str_mv AT lewang segmenttubespatiotemporalactionlocalizationinuntrimmedvideoswithperframesegmentation
AT xuhuanduan segmenttubespatiotemporalactionlocalizationinuntrimmedvideoswithperframesegmentation
AT qilinzhang segmenttubespatiotemporalactionlocalizationinuntrimmedvideoswithperframesegmentation
AT zhenxingniu segmenttubespatiotemporalactionlocalizationinuntrimmedvideoswithperframesegmentation
AT ganghua segmenttubespatiotemporalactionlocalizationinuntrimmedvideoswithperframesegmentation
AT nanningzheng segmenttubespatiotemporalactionlocalizationinuntrimmedvideoswithperframesegmentation
_version_ 1726007475563397120