Learning Representative Feature Expression Algorithm and Architecture for Image andAction Recognition
碩士 === 國立臺灣大學 === 電子工程學研究所 === 103 === In the past decade, computer vision makes great progress and has signi cant impact to our daily life. Various intelligent devices are developed based on big data analysis and machine learning algorithm. Take google glasses for example, this wearable device can...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2015
|
Online Access: | http://ndltd.ncl.edu.tw/handle/75257012226183461576 |
id |
ndltd-TW-103NTU05428112 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-103NTU054281122016-11-19T04:09:56Z http://ndltd.ncl.edu.tw/handle/75257012226183461576 Learning Representative Feature Expression Algorithm and Architecture for Image andAction Recognition 學習影像和動作辨識之代表性特徵值之演算法與架構設計 Kuo-Wei Tseng 曾國維 碩士 國立臺灣大學 電子工程學研究所 103 In the past decade, computer vision makes great progress and has signi cant impact to our daily life. Various intelligent devices are developed based on big data analysis and machine learning algorithm. Take google glasses for example, this wearable device can capture picture of people around you and analyze the image to recognize who they are. In some parking lot, vehicle license plate recognition system is used for automatic check-in and no more parking coin is needed to get through the gate. These applications show us the possibility to achieve a future life style by combining computer vision and machine learning. Further thinking about visual task, action recognition must be the top priority problem needed to be solved. In the near future, the intelligent robot will be invented which can interact with human-beings and do the most dangerous jobs for us. To do so, the machines must learn the meanings of images and actions, just like us. Visual tasks of videos recognition are much more complex than ones of image recognition. Videos sequence contains not only intensity and spatial information, but also temporal feature which implies the transformation between frames. With advancement of technology, the intelligent robots will be invented in the near future. Therefore, the machine vision in video domain which makes robots to learn our world is a vital issue, including action recognition. Several algorithm to deal with video tasks has been proposed in recent years, but the training procedure is too complex. In the thesis, we rst introduce some applications and common-used recognition pipeline in the eld of computer vision. A general visual recog- nition pipeline consists of three parts: (i) image/ video pre-processing, (ii) feature extraction, (iii) classi cation. In our approach, we focus on pre- processing and feature extraction part, using simple algorithm to achieve high performance. K-means clustering is broadly used for codebook generation in Bag of Visual Words (BOVW)[6] [7] method. It is known for its computational speed. The concept in [8] is to use K-means clustering not to learn the codebook from high level feature but to learn representative patches from pixel raw value. In contrast to constructing hierarchical and deep architec- ture to learn complex features, this method needs only tens of minutes to train and achieve good performance on CIFAR-10 dataset. In our approach, we extend the method from image domain to video domain, where K-means method clusters representative volumes of frames, instead of patches. How- ever, the dimensionality of volumes is much larger than the one of patches, and the size of training data in a video dataset is usually smaller than image dataset, so it is not large enough to train a good k-means model. Therefore, we proposed a method to learn volumes from different dataset to solve this problem. To sum-up, an action recognition system based on k-means cluster- ing method is designed. We can learn and extract features from different dataset. Furthermore, we propose a hardware architecture for this algo- rithm. This architecture can be both used in image/action recognition with some slight parameter changes Liang-Gee Chen 陳良基 2015 學位論文 ; thesis 55 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立臺灣大學 === 電子工程學研究所 === 103 === In the past decade, computer vision makes great progress and has signi cant
impact to our daily life. Various intelligent devices are developed based on
big data analysis and machine learning algorithm. Take google glasses for
example, this wearable device can capture picture of people around you
and analyze the image to recognize who they are. In some parking lot,
vehicle license plate recognition system is used for automatic check-in and
no more parking coin is needed to get through the gate. These applications
show us the possibility to achieve a future life style by combining computer
vision and machine learning. Further thinking about visual task, action
recognition must be the top priority problem needed to be solved. In the
near future, the intelligent robot will be invented which can interact with
human-beings and do the most dangerous jobs for us. To do so, the machines
must learn the meanings of images and actions, just like us.
Visual tasks of videos recognition are much more complex than ones of
image recognition. Videos sequence contains not only intensity and spatial
information, but also temporal feature which implies the transformation
between frames. With advancement of technology, the intelligent robots
will be invented in the near future. Therefore, the machine vision in video
domain which makes robots to learn our world is a vital issue, including
action recognition. Several algorithm to deal with video tasks has been
proposed in recent years, but the training procedure is too complex.
In the thesis, we rst introduce some applications and common-used
recognition pipeline in the eld of computer vision. A general visual recog-
nition pipeline consists of three parts: (i) image/ video pre-processing, (ii)
feature extraction, (iii) classi cation. In our approach, we focus on pre-
processing and feature extraction part, using simple algorithm to achieve
high performance.
K-means clustering is broadly used for codebook generation in Bag of
Visual Words (BOVW)[6] [7] method. It is known for its computational
speed. The concept in [8] is to use K-means clustering not to learn the
codebook from high level feature but to learn representative patches from
pixel raw value. In contrast to constructing hierarchical and deep architec-
ture to learn complex features, this method needs only tens of minutes to
train and achieve good performance on CIFAR-10 dataset. In our approach,
we extend the method from image domain to video domain, where K-means
method clusters representative volumes of frames, instead of patches. How-
ever, the dimensionality of volumes is much larger than the one of patches,
and the size of training data in a video dataset is usually smaller than image
dataset, so it is not large enough to train a good k-means model. Therefore,
we proposed a method to learn volumes from different dataset to solve this
problem.
To sum-up, an action recognition system based on k-means cluster-
ing method is designed. We can learn and extract features from different
dataset. Furthermore, we propose a hardware architecture for this algo-
rithm. This architecture can be both used in image/action recognition with
some slight parameter changes
|
author2 |
Liang-Gee Chen |
author_facet |
Liang-Gee Chen Kuo-Wei Tseng 曾國維 |
author |
Kuo-Wei Tseng 曾國維 |
spellingShingle |
Kuo-Wei Tseng 曾國維 Learning Representative Feature Expression Algorithm and Architecture for Image andAction Recognition |
author_sort |
Kuo-Wei Tseng |
title |
Learning Representative Feature Expression Algorithm and Architecture for Image andAction Recognition |
title_short |
Learning Representative Feature Expression Algorithm and Architecture for Image andAction Recognition |
title_full |
Learning Representative Feature Expression Algorithm and Architecture for Image andAction Recognition |
title_fullStr |
Learning Representative Feature Expression Algorithm and Architecture for Image andAction Recognition |
title_full_unstemmed |
Learning Representative Feature Expression Algorithm and Architecture for Image andAction Recognition |
title_sort |
learning representative feature expression algorithm and architecture for image andaction recognition |
publishDate |
2015 |
url |
http://ndltd.ncl.edu.tw/handle/75257012226183461576 |
work_keys_str_mv |
AT kuoweitseng learningrepresentativefeatureexpressionalgorithmandarchitectureforimageandactionrecognition AT céngguówéi learningrepresentativefeatureexpressionalgorithmandarchitectureforimageandactionrecognition AT kuoweitseng xuéxíyǐngxiànghédòngzuòbiànshízhīdàibiǎoxìngtèzhēngzhízhīyǎnsuànfǎyǔjiàgòushèjì AT céngguówéi xuéxíyǐngxiànghédòngzuòbiànshízhīdàibiǎoxìngtèzhēngzhízhīyǎnsuànfǎyǔjiàgòushèjì |
_version_ |
1718395074757263360 |