Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports

With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighti...

Full description

Bibliographic Details
Main Authors: Zhiying Jiang, Bo Gao, Yanlin He, Yongming Han, Paul Doyle, Qunxiong Zhu
Format: Article
Language:English
Published: Hindawi Limited 2021-01-01
Series:Mathematical Problems in Engineering
Online Access:http://dx.doi.org/10.1155/2021/6619088
id doaj-922f40a7018a44f2adf65a150e186cc1
record_format Article
spelling doaj-922f40a7018a44f2adf65a150e186cc12021-03-15T00:01:20ZengHindawi LimitedMathematical Problems in Engineering1563-51472021-01-01202110.1155/2021/6619088Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media ReportsZhiying Jiang0Bo Gao1Yanlin He2Yongming Han3Paul Doyle4Qunxiong Zhu5College of Information Science & TechnologyCollege of Information Science & TechnologyCollege of Information Science & TechnologyCollege of Information Science & TechnologySchool of Computer Science within the College of Science and HealthCollege of Information Science & TechnologyWith the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs DF¯, namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.http://dx.doi.org/10.1155/2021/6619088
collection DOAJ
language English
format Article
sources DOAJ
author Zhiying Jiang
Bo Gao
Yanlin He
Yongming Han
Paul Doyle
Qunxiong Zhu
spellingShingle Zhiying Jiang
Bo Gao
Yanlin He
Yongming Han
Paul Doyle
Qunxiong Zhu
Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports
Mathematical Problems in Engineering
author_facet Zhiying Jiang
Bo Gao
Yanlin He
Yongming Han
Paul Doyle
Qunxiong Zhu
author_sort Zhiying Jiang
title Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports
title_short Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports
title_full Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports
title_fullStr Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports
title_full_unstemmed Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports
title_sort text classification using novel term weighting scheme-based improved tf-idf for internet media reports
publisher Hindawi Limited
series Mathematical Problems in Engineering
issn 1563-5147
publishDate 2021-01-01
description With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs DF¯, namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results.
url http://dx.doi.org/10.1155/2021/6619088
work_keys_str_mv AT zhiyingjiang textclassificationusingnoveltermweightingschemebasedimprovedtfidfforinternetmediareports
AT bogao textclassificationusingnoveltermweightingschemebasedimprovedtfidfforinternetmediareports
AT yanlinhe textclassificationusingnoveltermweightingschemebasedimprovedtfidfforinternetmediareports
AT yongminghan textclassificationusingnoveltermweightingschemebasedimprovedtfidfforinternetmediareports
AT pauldoyle textclassificationusingnoveltermweightingschemebasedimprovedtfidfforinternetmediareports
AT qunxiongzhu textclassificationusingnoveltermweightingschemebasedimprovedtfidfforinternetmediareports
_version_ 1714785231340306432