Real-Time Video Object Detection with Temporal Feature Aggregation

In recent years, various high-performance networks have been proposed for single-image object detection. An obvious choice is to design a video detection network based on state-of-the-art single-image detectors. However, video object detection is still challenging due to the lower quality of individ...

Full description

Bibliographic Details
Main Author:	Chen, Meihong
Other Authors:	Lang, Jochen
Format:	Others
Language:	en
Published:	Université d'Ottawa / University of Ottawa 2021
Subjects:	Attention Mechanism AP3D CNN Octave Convolution One-Stage Detection Video Object Detection
Online Access:	http://hdl.handle.net/10393/42790 http://dx.doi.org/10.20381/ruor-27007

id	ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-42790
record_format	oai_dc
spelling	ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-427902021-10-07T05:23:30Z Real-Time Video Object Detection with Temporal Feature Aggregation Chen, Meihong Lang, Jochen Attention Mechanism AP3D CNN Octave Convolution One-Stage Detection Video Object Detection In recent years, various high-performance networks have been proposed for single-image object detection. An obvious choice is to design a video detection network based on state-of-the-art single-image detectors. However, video object detection is still challenging due to the lower quality of individual frames in a video, and hence the need to include temporal information for high-quality detection results. In this thesis, we design a novel interleaved architecture combining a 2D convolutional network and a 3D temporal network. We utilize Yolov3 as the base detector. To explore inter-frame information, we propose feature aggregation based on a temporal network. Our temporal network utilizes Appearance-preserving 3D convolution (AP3D) for extracting aligned features in the temporal dimension. Our multi-scale detector and multi-scale temporal network communicate at each scale and also across scales. The number of inputs of our temporal network can be either 4, 8, or 16 frames in this thesis and correspondingly we name our temporal network TemporalNet-4, TemporalNet-8 and TemporalNet-16. Our approach achieves 77.1\% mAP (mean Average Precision) on ImageNet VID 2017 dataset with TemporalNet-4, where TemporalNet-16 achieves 80.9\% mAP which is a competitive result on this video object detection benchmark. Our network is also real-time with a running time of 35ms/frame. 2021-10-05T18:00:57Z 2021-10-05T18:00:57Z 2021-10-05 Thesis http://hdl.handle.net/10393/42790 http://dx.doi.org/10.20381/ruor-27007 en application/pdf Université d'Ottawa / University of Ottawa
collection	NDLTD
language	en
format	Others
sources	NDLTD
topic	Attention Mechanism AP3D CNN Octave Convolution One-Stage Detection Video Object Detection
spellingShingle	Attention Mechanism AP3D CNN Octave Convolution One-Stage Detection Video Object Detection Chen, Meihong Real-Time Video Object Detection with Temporal Feature Aggregation
description	In recent years, various high-performance networks have been proposed for single-image object detection. An obvious choice is to design a video detection network based on state-of-the-art single-image detectors. However, video object detection is still challenging due to the lower quality of individual frames in a video, and hence the need to include temporal information for high-quality detection results. In this thesis, we design a novel interleaved architecture combining a 2D convolutional network and a 3D temporal network. We utilize Yolov3 as the base detector. To explore inter-frame information, we propose feature aggregation based on a temporal network. Our temporal network utilizes Appearance-preserving 3D convolution (AP3D) for extracting aligned features in the temporal dimension. Our multi-scale detector and multi-scale temporal network communicate at each scale and also across scales. The number of inputs of our temporal network can be either 4, 8, or 16 frames in this thesis and correspondingly we name our temporal network TemporalNet-4, TemporalNet-8 and TemporalNet-16. Our approach achieves 77.1\% mAP (mean Average Precision) on ImageNet VID 2017 dataset with TemporalNet-4, where TemporalNet-16 achieves 80.9\% mAP which is a competitive result on this video object detection benchmark. Our network is also real-time with a running time of 35ms/frame.
author2	Lang, Jochen
author_facet	Lang, Jochen Chen, Meihong
author	Chen, Meihong
author_sort	Chen, Meihong
title	Real-Time Video Object Detection with Temporal Feature Aggregation
title_short	Real-Time Video Object Detection with Temporal Feature Aggregation
title_full	Real-Time Video Object Detection with Temporal Feature Aggregation
title_fullStr	Real-Time Video Object Detection with Temporal Feature Aggregation
title_full_unstemmed	Real-Time Video Object Detection with Temporal Feature Aggregation
title_sort	real-time video object detection with temporal feature aggregation
publisher	Université d'Ottawa / University of Ottawa
publishDate	2021
url	http://hdl.handle.net/10393/42790 http://dx.doi.org/10.20381/ruor-27007
work_keys_str_mv	AT chenmeihong realtimevideoobjectdetectionwithtemporalfeatureaggregation
_version_	1719487880849522688

Real-Time Video Object Detection with Temporal Feature Aggregation

Similar Items