A Classification Method with Taxonomy and Feature Selection for Gene Sequence Data

碩士 === 國立成功大學 === 資訊管理研究所 === 100 === In order to explore the microbes in an ecological environment, traditional studies for biologists are to culture them in laboratories. However, the microbes that can be cultured in laboratories are less than one percent of the whole microbes. Metagenomics is...

Full description

Bibliographic Details
Main Authors: Hung-YuChen, 陳泓宇
Other Authors: Tzu-Tsung Weng
Format: Others
Language:zh-TW
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/45947591403813350639
Description
Summary:碩士 === 國立成功大學 === 資訊管理研究所 === 100 === In order to explore the microbes in an ecological environment, traditional studies for biologists are to culture them in laboratories. However, the microbes that can be cultured in laboratories are less than one percent of the whole microbes. Metagenomics is one of the popular topics in biology for studying microbes efficiently. Metagenomics is a technique for researchers to extract samples and sequences from ecological environments directly. Gene sequence data will be reorganized to create databases for ecological environments by classification techniques. Since the numbers of class values and features for gene sequence data are both large, it should be worthy to develop methods for improving the accuracy and computational efficiency in classifying gene sequence data. In this study, we combine taxonomy concept with feature selection technology for this purpose. Since taxonomy records the relations between the class values in different levels, it can be used to exclude unnecessary class values for class prediction. Feature selection technology can reduce the dimensionality of a data set. The experimental results on two gene sequence data sets show that our classification method outperforms the one with taxonomy in both prediction accuracy and computational efficiency. With respect to the method with feature selection, our method greatly improves the computational efficiency of the naïve Bayesian classifier, while the prediction accuracy of our method is slightly inferior in both family and genus levels.