Multi-teacher Knowledge Distillation for Compressed Video Action Recognition on Deep Neural Networks

碩士 === 國立清華大學 === 資訊工程學系所 === 106 === Human action recognition has been an active research topic in computer vision due to its wide range of applications, such as smart surveillance, smart home and health care monitoring. Implementation of these applications using VLSI or embedded computing systems...

Full description

Bibliographic Details
Main Authors: Wu, Meng-Chieh, 吳孟潔
Other Authors: Chiu, Ching-Te
Format: Others
Language:en_US
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/nwqj9a
Description
Summary:碩士 === 國立清華大學 === 資訊工程學系所 === 106 === Human action recognition has been an active research topic in computer vision due to its wide range of applications, such as smart surveillance, smart home and health care monitoring. Implementation of these applications using VLSI or embedded computing systems has low-power and real-time requirements. Recently, Convolutional Networks have great progress in classifying images, ConvNets have also been considered to solve action recognition problem. While action recognition is different from still image classification, video data contains temporal information which plays an important role in video understanding. Most current approaches for action recognition use multiple CNNs to learn spatial and temporal features respectively, then fuse their results for final prediction. This greatly increases the amount of parameters. Moreover, most of the methods takes dense optical flow as motion representation, hence the computational cost is excessive, making the whole model being cumbersome and slow. There are other approaches to learn spatio-temporal feature by stacking multiple continuous frames into a single 3D ConvNet, but consecutive frames are highly redundant, and 3D convolution causes an explosion of parameters and computation time. These above methods are unable to perform action recognition efficiently. The most efficient method currently trained a deep network directly on the compressed video contains the motion information, replacing the optical flow field with excessive computational cost. However, this method has a large amount of parameters and requires excessive storage space about 300 MB. We propose a multi-teacher knowledge distillation framework for compressed video action recognition to compress this model. With this framework, the model is compressed by transferring the knowledge of multiple teachers to a single small student model. We integrate the knowledge from different teachers with various input types, and teach the students with this comprehensive knowledge. With multi-teacher knowledge distillation, students learns better than single-teacher knowledge distillation. Experiments show that we can reach a 2.4× compression rate, requiring storage space about 125 MB and 1.2× computation reduction with about 2.14\% loss of accuracy on the UCF-101 dataset.