Summary: | 碩士 === 國立臺灣大學 === 電子工程學研究所 === 103 === In the past decade, computer vision makes great progress and has signi cant
impact to our daily life. Various intelligent devices are developed based on
big data analysis and machine learning algorithm. Take google glasses for
example, this wearable device can capture picture of people around you
and analyze the image to recognize who they are. In some parking lot,
vehicle license plate recognition system is used for automatic check-in and
no more parking coin is needed to get through the gate. These applications
show us the possibility to achieve a future life style by combining computer
vision and machine learning. Further thinking about visual task, action
recognition must be the top priority problem needed to be solved. In the
near future, the intelligent robot will be invented which can interact with
human-beings and do the most dangerous jobs for us. To do so, the machines
must learn the meanings of images and actions, just like us.
Visual tasks of videos recognition are much more complex than ones of
image recognition. Videos sequence contains not only intensity and spatial
information, but also temporal feature which implies the transformation
between frames. With advancement of technology, the intelligent robots
will be invented in the near future. Therefore, the machine vision in video
domain which makes robots to learn our world is a vital issue, including
action recognition. Several algorithm to deal with video tasks has been
proposed in recent years, but the training procedure is too complex.
In the thesis, we rst introduce some applications and common-used
recognition pipeline in the eld of computer vision. A general visual recog-
nition pipeline consists of three parts: (i) image/ video pre-processing, (ii)
feature extraction, (iii) classi cation. In our approach, we focus on pre-
processing and feature extraction part, using simple algorithm to achieve
high performance.
K-means clustering is broadly used for codebook generation in Bag of
Visual Words (BOVW)[6] [7] method. It is known for its computational
speed. The concept in [8] is to use K-means clustering not to learn the
codebook from high level feature but to learn representative patches from
pixel raw value. In contrast to constructing hierarchical and deep architec-
ture to learn complex features, this method needs only tens of minutes to
train and achieve good performance on CIFAR-10 dataset. In our approach,
we extend the method from image domain to video domain, where K-means
method clusters representative volumes of frames, instead of patches. How-
ever, the dimensionality of volumes is much larger than the one of patches,
and the size of training data in a video dataset is usually smaller than image
dataset, so it is not large enough to train a good k-means model. Therefore,
we proposed a method to learn volumes from different dataset to solve this
problem.
To sum-up, an action recognition system based on k-means cluster-
ing method is designed. We can learn and extract features from different
dataset. Furthermore, we propose a hardware architecture for this algo-
rithm. This architecture can be both used in image/action recognition with
some slight parameter changes
|