Chinese text classification based on improved <i>K</i> Nearest Neighbor algorithm

This paper focuses on the high dimensional text problems encountered in text classification.Document frequency(DF)-chi square statistic feature extraction method is proposed to reduce the feature items and reduce the dimension of text.Based on the <i>K</i> Nearest Neighbor(KNN) algorithm...

Full description

Bibliographic Details
Main Authors: HUANG Chao, CHEN Junhua
Format: Article
Language:English
Published: Academic Journals Center of Shanghai Normal University 2019-02-01
Series:Journal of Shanghai Normal University (Natural Sciences)
Subjects:
Online Access:http://qktg.shnu.edu.cn/zrb/shsfqkszrb/ch/reader/view_abstract.aspx?file_no=20190117
id doaj-c002127def1546bd97b6579e91db6a9a
record_format Article
spelling doaj-c002127def1546bd97b6579e91db6a9a2020-11-25T01:19:30ZengAcademic Journals Center of Shanghai Normal UniversityJournal of Shanghai Normal University (Natural Sciences)1000-51371000-51372019-02-014819610110.3969/J.ISSN.1000-5137.2019.01.017201901000017Chinese text classification based on improved <i>K</i> Nearest Neighbor algorithmHUANG Chao0CHEN Junhua1College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, ChinaCollege of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, ChinaThis paper focuses on the high dimensional text problems encountered in text classification.Document frequency(DF)-chi square statistic feature extraction method is proposed to reduce the feature items and reduce the dimension of text.Based on the <i>K</i> Nearest Neighbor(KNN) algorithm,in view of the problem that text to be classified should be calculated in similarity with a large number of training set samples,a KNN algorithm based on grouping center vector is proposed.The center vectors of each group were obtained by grouping the sample sets in the category,so as to improve the classification performance of the algorithm.Experiments show that the improved algorithm has improved the precision rate,recall rate and <i>F</i>-measure compared with the traditional KNN algorithm,and it takes advantages of other classification algorithms.http://qktg.shnu.edu.cn/zrb/shsfqkszrb/ch/reader/view_abstract.aspx?file_no=20190117text classification; <i>K</i> Nearest Neighbor(KNN)algorithm; feature extraction; similarity
collection DOAJ
language English
format Article
sources DOAJ
author HUANG Chao
CHEN Junhua
spellingShingle HUANG Chao
CHEN Junhua
Chinese text classification based on improved <i>K</i> Nearest Neighbor algorithm
Journal of Shanghai Normal University (Natural Sciences)
text classification; <i>K</i> Nearest Neighbor(KNN)algorithm; feature extraction; similarity
author_facet HUANG Chao
CHEN Junhua
author_sort HUANG Chao
title Chinese text classification based on improved <i>K</i> Nearest Neighbor algorithm
title_short Chinese text classification based on improved <i>K</i> Nearest Neighbor algorithm
title_full Chinese text classification based on improved <i>K</i> Nearest Neighbor algorithm
title_fullStr Chinese text classification based on improved <i>K</i> Nearest Neighbor algorithm
title_full_unstemmed Chinese text classification based on improved <i>K</i> Nearest Neighbor algorithm
title_sort chinese text classification based on improved <i>k</i> nearest neighbor algorithm
publisher Academic Journals Center of Shanghai Normal University
series Journal of Shanghai Normal University (Natural Sciences)
issn 1000-5137
1000-5137
publishDate 2019-02-01
description This paper focuses on the high dimensional text problems encountered in text classification.Document frequency(DF)-chi square statistic feature extraction method is proposed to reduce the feature items and reduce the dimension of text.Based on the <i>K</i> Nearest Neighbor(KNN) algorithm,in view of the problem that text to be classified should be calculated in similarity with a large number of training set samples,a KNN algorithm based on grouping center vector is proposed.The center vectors of each group were obtained by grouping the sample sets in the category,so as to improve the classification performance of the algorithm.Experiments show that the improved algorithm has improved the precision rate,recall rate and <i>F</i>-measure compared with the traditional KNN algorithm,and it takes advantages of other classification algorithms.
topic text classification; <i>K</i> Nearest Neighbor(KNN)algorithm; feature extraction; similarity
url http://qktg.shnu.edu.cn/zrb/shsfqkszrb/ch/reader/view_abstract.aspx?file_no=20190117
work_keys_str_mv AT huangchao chinesetextclassificationbasedonimprovedikinearestneighboralgorithm
AT chenjunhua chinesetextclassificationbasedonimprovedikinearestneighboralgorithm
_version_ 1725137918571315200