Distance Variance Score: An Efficient Feature Selection Method in Text Classification
With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documen...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Hindawi Limited
2015-01-01
|
Series: | Mathematical Problems in Engineering |
Online Access: | http://dx.doi.org/10.1155/2015/695720 |
id |
doaj-d7068e267d7f4897b1c5ac640dfb0337 |
---|---|
record_format |
Article |
spelling |
doaj-d7068e267d7f4897b1c5ac640dfb03372020-11-24T22:37:42ZengHindawi LimitedMathematical Problems in Engineering1024-123X1563-51472015-01-01201510.1155/2015/695720695720Distance Variance Score: An Efficient Feature Selection Method in Text ClassificationHeyong Wang0Ming Hong1Department of E-Business, South China University of Technology, Guangzhou 510006, ChinaDepartment of E-Business, South China University of Technology, Guangzhou 510006, ChinaWith the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.http://dx.doi.org/10.1155/2015/695720 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Heyong Wang Ming Hong |
spellingShingle |
Heyong Wang Ming Hong Distance Variance Score: An Efficient Feature Selection Method in Text Classification Mathematical Problems in Engineering |
author_facet |
Heyong Wang Ming Hong |
author_sort |
Heyong Wang |
title |
Distance Variance Score: An Efficient Feature Selection Method in Text Classification |
title_short |
Distance Variance Score: An Efficient Feature Selection Method in Text Classification |
title_full |
Distance Variance Score: An Efficient Feature Selection Method in Text Classification |
title_fullStr |
Distance Variance Score: An Efficient Feature Selection Method in Text Classification |
title_full_unstemmed |
Distance Variance Score: An Efficient Feature Selection Method in Text Classification |
title_sort |
distance variance score: an efficient feature selection method in text classification |
publisher |
Hindawi Limited |
series |
Mathematical Problems in Engineering |
issn |
1024-123X 1563-5147 |
publishDate |
2015-01-01 |
description |
With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS. |
url |
http://dx.doi.org/10.1155/2015/695720 |
work_keys_str_mv |
AT heyongwang distancevariancescoreanefficientfeatureselectionmethodintextclassification AT minghong distancevariancescoreanefficientfeatureselectionmethodintextclassification |
_version_ |
1725715812165091328 |