A feature selection method based on synonym merging in text classification system

Abstract As an important step in natural language processing (NLP), text classification system has been widely used in many fields, like spam filtering, news classification, and web page detection. Vector space model (VSM) is generally used to extract feature vectors for representing texts which is...

Full description

Bibliographic Details
Main Authors: Haipeng Yao, Chong Liu, Peiying Zhang, Luyao Wang
Format: Article
Language:English
Published: SpringerOpen 2017-10-01
Series:EURASIP Journal on Wireless Communications and Networking
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13638-017-0950-z
id doaj-0625f7043b8d48bf8ce0ebd59de8c69f
record_format Article
spelling doaj-0625f7043b8d48bf8ce0ebd59de8c69f2020-11-25T01:03:11ZengSpringerOpenEURASIP Journal on Wireless Communications and Networking1687-14992017-10-01201711810.1186/s13638-017-0950-zA feature selection method based on synonym merging in text classification systemHaipeng Yao0Chong Liu1Peiying Zhang2Luyao Wang3State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and TelecommunicationsState Key Laboratory of Networking and Switching Technology, Beijing University of Posts and TelecommunicationsState Key Laboratory of Networking and Switching Technology, Beijing University of Posts and TelecommunicationsAdvanced Innovation Center for Future Internet Technology, Beijing University of TechnologyAbstract As an important step in natural language processing (NLP), text classification system has been widely used in many fields, like spam filtering, news classification, and web page detection. Vector space model (VSM) is generally used to extract feature vectors for representing texts which is very important for text classification. In this paper, a feature selection algorithm based on synonym merging named SM-CHI is proposed. Besides, the improved CHI formula and synonym merging are used to select feature words so that the accuracy of classification can be improved and the feature dimension can be reduced. In addition, for feature words selected by SM-CHI, this paper presented three weight calculation algorithms to explore the best feature weight update method. Finally, we designed three comparative experiments and proved the classification accuracy is the highest when choosing the improved CHI formula 2, set the threshold α to 0.8 and use the largest weight among the synonyms to update the feature weight, respectively.http://link.springer.com/article/10.1186/s13638-017-0950-zText classificationFeature selectionSynonym mergingFeature weights calculation
collection DOAJ
language English
format Article
sources DOAJ
author Haipeng Yao
Chong Liu
Peiying Zhang
Luyao Wang
spellingShingle Haipeng Yao
Chong Liu
Peiying Zhang
Luyao Wang
A feature selection method based on synonym merging in text classification system
EURASIP Journal on Wireless Communications and Networking
Text classification
Feature selection
Synonym merging
Feature weights calculation
author_facet Haipeng Yao
Chong Liu
Peiying Zhang
Luyao Wang
author_sort Haipeng Yao
title A feature selection method based on synonym merging in text classification system
title_short A feature selection method based on synonym merging in text classification system
title_full A feature selection method based on synonym merging in text classification system
title_fullStr A feature selection method based on synonym merging in text classification system
title_full_unstemmed A feature selection method based on synonym merging in text classification system
title_sort feature selection method based on synonym merging in text classification system
publisher SpringerOpen
series EURASIP Journal on Wireless Communications and Networking
issn 1687-1499
publishDate 2017-10-01
description Abstract As an important step in natural language processing (NLP), text classification system has been widely used in many fields, like spam filtering, news classification, and web page detection. Vector space model (VSM) is generally used to extract feature vectors for representing texts which is very important for text classification. In this paper, a feature selection algorithm based on synonym merging named SM-CHI is proposed. Besides, the improved CHI formula and synonym merging are used to select feature words so that the accuracy of classification can be improved and the feature dimension can be reduced. In addition, for feature words selected by SM-CHI, this paper presented three weight calculation algorithms to explore the best feature weight update method. Finally, we designed three comparative experiments and proved the classification accuracy is the highest when choosing the improved CHI formula 2, set the threshold α to 0.8 and use the largest weight among the synonyms to update the feature weight, respectively.
topic Text classification
Feature selection
Synonym merging
Feature weights calculation
url http://link.springer.com/article/10.1186/s13638-017-0950-z
work_keys_str_mv AT haipengyao afeatureselectionmethodbasedonsynonymmergingintextclassificationsystem
AT chongliu afeatureselectionmethodbasedonsynonymmergingintextclassificationsystem
AT peiyingzhang afeatureselectionmethodbasedonsynonymmergingintextclassificationsystem
AT luyaowang afeatureselectionmethodbasedonsynonymmergingintextclassificationsystem
AT haipengyao featureselectionmethodbasedonsynonymmergingintextclassificationsystem
AT chongliu featureselectionmethodbasedonsynonymmergingintextclassificationsystem
AT peiyingzhang featureselectionmethodbasedonsynonymmergingintextclassificationsystem
AT luyaowang featureselectionmethodbasedonsynonymmergingintextclassificationsystem
_version_ 1725201918773428224