A feature selection method based on synonym merging in text classification system
Abstract As an important step in natural language processing (NLP), text classification system has been widely used in many fields, like spam filtering, news classification, and web page detection. Vector space model (VSM) is generally used to extract feature vectors for representing texts which is...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2017-10-01
|
Series: | EURASIP Journal on Wireless Communications and Networking |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s13638-017-0950-z |
id |
doaj-0625f7043b8d48bf8ce0ebd59de8c69f |
---|---|
record_format |
Article |
spelling |
doaj-0625f7043b8d48bf8ce0ebd59de8c69f2020-11-25T01:03:11ZengSpringerOpenEURASIP Journal on Wireless Communications and Networking1687-14992017-10-01201711810.1186/s13638-017-0950-zA feature selection method based on synonym merging in text classification systemHaipeng Yao0Chong Liu1Peiying Zhang2Luyao Wang3State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and TelecommunicationsState Key Laboratory of Networking and Switching Technology, Beijing University of Posts and TelecommunicationsState Key Laboratory of Networking and Switching Technology, Beijing University of Posts and TelecommunicationsAdvanced Innovation Center for Future Internet Technology, Beijing University of TechnologyAbstract As an important step in natural language processing (NLP), text classification system has been widely used in many fields, like spam filtering, news classification, and web page detection. Vector space model (VSM) is generally used to extract feature vectors for representing texts which is very important for text classification. In this paper, a feature selection algorithm based on synonym merging named SM-CHI is proposed. Besides, the improved CHI formula and synonym merging are used to select feature words so that the accuracy of classification can be improved and the feature dimension can be reduced. In addition, for feature words selected by SM-CHI, this paper presented three weight calculation algorithms to explore the best feature weight update method. Finally, we designed three comparative experiments and proved the classification accuracy is the highest when choosing the improved CHI formula 2, set the threshold α to 0.8 and use the largest weight among the synonyms to update the feature weight, respectively.http://link.springer.com/article/10.1186/s13638-017-0950-zText classificationFeature selectionSynonym mergingFeature weights calculation |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Haipeng Yao Chong Liu Peiying Zhang Luyao Wang |
spellingShingle |
Haipeng Yao Chong Liu Peiying Zhang Luyao Wang A feature selection method based on synonym merging in text classification system EURASIP Journal on Wireless Communications and Networking Text classification Feature selection Synonym merging Feature weights calculation |
author_facet |
Haipeng Yao Chong Liu Peiying Zhang Luyao Wang |
author_sort |
Haipeng Yao |
title |
A feature selection method based on synonym merging in text classification system |
title_short |
A feature selection method based on synonym merging in text classification system |
title_full |
A feature selection method based on synonym merging in text classification system |
title_fullStr |
A feature selection method based on synonym merging in text classification system |
title_full_unstemmed |
A feature selection method based on synonym merging in text classification system |
title_sort |
feature selection method based on synonym merging in text classification system |
publisher |
SpringerOpen |
series |
EURASIP Journal on Wireless Communications and Networking |
issn |
1687-1499 |
publishDate |
2017-10-01 |
description |
Abstract As an important step in natural language processing (NLP), text classification system has been widely used in many fields, like spam filtering, news classification, and web page detection. Vector space model (VSM) is generally used to extract feature vectors for representing texts which is very important for text classification. In this paper, a feature selection algorithm based on synonym merging named SM-CHI is proposed. Besides, the improved CHI formula and synonym merging are used to select feature words so that the accuracy of classification can be improved and the feature dimension can be reduced. In addition, for feature words selected by SM-CHI, this paper presented three weight calculation algorithms to explore the best feature weight update method. Finally, we designed three comparative experiments and proved the classification accuracy is the highest when choosing the improved CHI formula 2, set the threshold α to 0.8 and use the largest weight among the synonyms to update the feature weight, respectively. |
topic |
Text classification Feature selection Synonym merging Feature weights calculation |
url |
http://link.springer.com/article/10.1186/s13638-017-0950-z |
work_keys_str_mv |
AT haipengyao afeatureselectionmethodbasedonsynonymmergingintextclassificationsystem AT chongliu afeatureselectionmethodbasedonsynonymmergingintextclassificationsystem AT peiyingzhang afeatureselectionmethodbasedonsynonymmergingintextclassificationsystem AT luyaowang afeatureselectionmethodbasedonsynonymmergingintextclassificationsystem AT haipengyao featureselectionmethodbasedonsynonymmergingintextclassificationsystem AT chongliu featureselectionmethodbasedonsynonymmergingintextclassificationsystem AT peiyingzhang featureselectionmethodbasedonsynonymmergingintextclassificationsystem AT luyaowang featureselectionmethodbasedonsynonymmergingintextclassificationsystem |
_version_ |
1725201918773428224 |