Round Robin Bag-of-words Generation for Text Classification

碩士 === 國立臺灣科技大學 === 資訊工程系 === 95 === In text classification, bag-of-words representation is a widely used method to represent documents, and it leads to a major characteristic: the high dimensionality of the sparse data. How to reduce the dimension of data efficiently to achieve a better performance...

Full description

Bibliographic Details
Main Authors: Wei-han Chen, 陳威翰
Other Authors: Yuh-jye Lee
Format: Others
Language:en_US
Published: 2007
Online Access:http://ndltd.ncl.edu.tw/handle/hzz4y4
Description
Summary:碩士 === 國立臺灣科技大學 === 資訊工程系 === 95 === In text classification, bag-of-words representation is a widely used method to represent documents, and it leads to a major characteristic: the high dimensionality of the sparse data. How to reduce the dimension of data efficiently to achieve a better performance in a learning task is an on-going research. There are two different approaches, feature extraction and feature selection, to reduce the data dimension. Feature extraction (dimension reduction) algorithms are finding an efficient subspace of the original space and then projecting the data on the subspace; feature selection methods are selecting informative features according to different feature evaluation criteria. In this thesis, we propose the "Round Robin Bag-of-words Generation (RRBWG)" for text classification which combines the ideas of dimension reduction and feature selection. The RRBWG method contains two stages. In the first stage, we extract the main concept of each category by the famous dimension reduction method, Latent Semantic Indexing (LSI); then we rank features by corresponding coefficient in extracted concepts for each category, since these concepts are the linear combination of the features. In the second stage, we select resulting ranked features in the first stage from each category by round robin selecting strategy to generate the bag-of-words. Comparison results show that RRBWG performs better than classical feature selection method with two feature scoring criteria, MI and chi-square-test and without feature selection methods consistently in four benchmark text collections. Furthermore, we introduce RRBWG into these two feature scoring criteria. Experimental results show that RRBWG can cooperate with various feature scoring criteria with different strength on benchmark datasets.