Summary: | 博士 === 國立臺灣科技大學 === 電子工程系 === 105 === This dissertation presents a unified framework for video anomaly detection and localization via hierarchical feature representations, kernel-based statistical models, and tree-based search algorithms. While most research on this topic has focused more on detecting local anomalies, which refer to video events with unusual appearances or motions, we are more interested in global anomalies that involve multiple video events interacting in an unusual manner, even if any individual video event can be normal. To simultaneously detect local and global anomalies, we first introduce a hierarchical feature structure for video event representation. Then, a statistical model is built to understand the normal events in a training set which does not contain any anomalies, based on which a tree-based inference algorithm is developed to detect and locate abnormal events in unseen-before test videos.
Along the same structure, we gradually enrich our feature structures, statistical models, and inference algorithms to increasingly improve our previous methods. In this dissertation, we investigate two different hierarchical feature representations: 1) the bag-of-words histogram (BOW) and 2) the {\it ensemble} of nearby spatio-temporal interest points (STIP); two different kernel-based statistical models: 1) one-class support vector machine (SVM) and 2) Gaussian process regression (GPR); and two different inference algorithms: 1) single-instance path search and 2) multiple-instance path search (MiPS). Simulations on five popular benchmarks show that the proposed methods significantly outperform the main state-of-the-art methods, yet with lower computation time.
We also demonstrate that such a framework can be successfully applied to improve many convolution neural network (CNN) based object recognition methods. This is achieved by developing an iterative localization refinement (ILR) algorithm as a post-processing scheme to refine these object detection results in an iterative manner in order to match as much ground-truth as possible. Simulations show that the proposed method can improve the main state-of-the-art works on the large-scale PASCAL VOC 2007, 2012, and Youtube-Object datasets.
|