Distance Variance Score: An Efficient Feature Selection Method in Text Classification

With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documen...

Full description

Bibliographic Details
Main Authors:	Heyong Wang, Ming Hong
Format:	Article
Language:	English
Published:	Hindawi Limited 2015-01-01
Series:	Mathematical Problems in Engineering
Online Access:	http://dx.doi.org/10.1155/2015/695720

id	doaj-d7068e267d7f4897b1c5ac640dfb0337
record_format	Article
spelling	doaj-d7068e267d7f4897b1c5ac640dfb03372020-11-24T22:37:42ZengHindawi LimitedMathematical Problems in Engineering1024-123X1563-51472015-01-01201510.1155/2015/695720695720Distance Variance Score: An Efficient Feature Selection Method in Text ClassificationHeyong Wang0Ming Hong1Department of E-Business, South China University of Technology, Guangzhou 510006, ChinaDepartment of E-Business, South China University of Technology, Guangzhou 510006, ChinaWith the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.http://dx.doi.org/10.1155/2015/695720
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Heyong Wang Ming Hong
spellingShingle	Heyong Wang Ming Hong Distance Variance Score: An Efficient Feature Selection Method in Text Classification Mathematical Problems in Engineering
author_facet	Heyong Wang Ming Hong
author_sort	Heyong Wang
title	Distance Variance Score: An Efficient Feature Selection Method in Text Classification
title_short	Distance Variance Score: An Efficient Feature Selection Method in Text Classification
title_full	Distance Variance Score: An Efficient Feature Selection Method in Text Classification
title_fullStr	Distance Variance Score: An Efficient Feature Selection Method in Text Classification
title_full_unstemmed	Distance Variance Score: An Efficient Feature Selection Method in Text Classification
title_sort	distance variance score: an efficient feature selection method in text classification
publisher	Hindawi Limited
series	Mathematical Problems in Engineering
issn	1024-123X 1563-5147
publishDate	2015-01-01
description	With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.
url	http://dx.doi.org/10.1155/2015/695720
work_keys_str_mv	AT heyongwang distancevariancescoreanefficientfeatureselectionmethodintextclassification AT minghong distancevariancescoreanefficientfeatureselectionmethodintextclassification
_version_	1725715812165091328

Distance Variance Score: An Efficient Feature Selection Method in Text Classification

Similar Items