Distance Variance Score: An Efficient Feature Selection Method in Text Classification

With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documen...

Full description

Bibliographic Details
Main Authors: Heyong Wang, Ming Hong
Format: Article
Language:English
Published: Hindawi Limited 2015-01-01
Series:Mathematical Problems in Engineering
Online Access:http://dx.doi.org/10.1155/2015/695720
id doaj-d7068e267d7f4897b1c5ac640dfb0337
record_format Article
spelling doaj-d7068e267d7f4897b1c5ac640dfb03372020-11-24T22:37:42ZengHindawi LimitedMathematical Problems in Engineering1024-123X1563-51472015-01-01201510.1155/2015/695720695720Distance Variance Score: An Efficient Feature Selection Method in Text ClassificationHeyong Wang0Ming Hong1Department of E-Business, South China University of Technology, Guangzhou 510006, ChinaDepartment of E-Business, South China University of Technology, Guangzhou 510006, ChinaWith the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.http://dx.doi.org/10.1155/2015/695720
collection DOAJ
language English
format Article
sources DOAJ
author Heyong Wang
Ming Hong
spellingShingle Heyong Wang
Ming Hong
Distance Variance Score: An Efficient Feature Selection Method in Text Classification
Mathematical Problems in Engineering
author_facet Heyong Wang
Ming Hong
author_sort Heyong Wang
title Distance Variance Score: An Efficient Feature Selection Method in Text Classification
title_short Distance Variance Score: An Efficient Feature Selection Method in Text Classification
title_full Distance Variance Score: An Efficient Feature Selection Method in Text Classification
title_fullStr Distance Variance Score: An Efficient Feature Selection Method in Text Classification
title_full_unstemmed Distance Variance Score: An Efficient Feature Selection Method in Text Classification
title_sort distance variance score: an efficient feature selection method in text classification
publisher Hindawi Limited
series Mathematical Problems in Engineering
issn 1024-123X
1563-5147
publishDate 2015-01-01
description With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.
url http://dx.doi.org/10.1155/2015/695720
work_keys_str_mv AT heyongwang distancevariancescoreanefficientfeatureselectionmethodintextclassification
AT minghong distancevariancescoreanefficientfeatureselectionmethodintextclassification
_version_ 1725715812165091328