Learning Representative Feature Expression Algorithm and Architecture for Image andAction Recognition

碩士 === 國立臺灣大學 === 電子工程學研究所 === 103 === In the past decade, computer vision makes great progress and has signi cant impact to our daily life. Various intelligent devices are developed based on big data analysis and machine learning algorithm. Take google glasses for example, this wearable device can...

Full description

Bibliographic Details
Main Authors: Kuo-Wei Tseng, 曾國維
Other Authors: Liang-Gee Chen
Format: Others
Language:en_US
Published: 2015
Online Access:http://ndltd.ncl.edu.tw/handle/75257012226183461576
Description
Summary:碩士 === 國立臺灣大學 === 電子工程學研究所 === 103 === In the past decade, computer vision makes great progress and has signi cant impact to our daily life. Various intelligent devices are developed based on big data analysis and machine learning algorithm. Take google glasses for example, this wearable device can capture picture of people around you and analyze the image to recognize who they are. In some parking lot, vehicle license plate recognition system is used for automatic check-in and no more parking coin is needed to get through the gate. These applications show us the possibility to achieve a future life style by combining computer vision and machine learning. Further thinking about visual task, action recognition must be the top priority problem needed to be solved. In the near future, the intelligent robot will be invented which can interact with human-beings and do the most dangerous jobs for us. To do so, the machines must learn the meanings of images and actions, just like us. Visual tasks of videos recognition are much more complex than ones of image recognition. Videos sequence contains not only intensity and spatial information, but also temporal feature which implies the transformation between frames. With advancement of technology, the intelligent robots will be invented in the near future. Therefore, the machine vision in video domain which makes robots to learn our world is a vital issue, including action recognition. Several algorithm to deal with video tasks has been proposed in recent years, but the training procedure is too complex. In the thesis, we rst introduce some applications and common-used recognition pipeline in the eld of computer vision. A general visual recog- nition pipeline consists of three parts: (i) image/ video pre-processing, (ii) feature extraction, (iii) classi cation. In our approach, we focus on pre- processing and feature extraction part, using simple algorithm to achieve high performance. K-means clustering is broadly used for codebook generation in Bag of Visual Words (BOVW)[6] [7] method. It is known for its computational speed. The concept in [8] is to use K-means clustering not to learn the codebook from high level feature but to learn representative patches from pixel raw value. In contrast to constructing hierarchical and deep architec- ture to learn complex features, this method needs only tens of minutes to train and achieve good performance on CIFAR-10 dataset. In our approach, we extend the method from image domain to video domain, where K-means method clusters representative volumes of frames, instead of patches. How- ever, the dimensionality of volumes is much larger than the one of patches, and the size of training data in a video dataset is usually smaller than image dataset, so it is not large enough to train a good k-means model. Therefore, we proposed a method to learn volumes from different dataset to solve this problem. To sum-up, an action recognition system based on k-means cluster- ing method is designed. We can learn and extract features from different dataset. Furthermore, we propose a hardware architecture for this algo- rithm. This architecture can be both used in image/action recognition with some slight parameter changes