Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet

碩士 === 樹德科技大學 === 資訊管理系碩士班 === 103 === Text mining often uses the TF-IDF (term frequence-inverse document frequency) technique to preprocess documents because of a common knowledge that “if a word has a high frequency in a document and does’nt appear in too many other documents, then the word has a...

Full description

Bibliographic Details
Main Authors:	Tsong-ming Chen, 陳聰敏
Other Authors:	Shing-Hwang Tung
Format:	Others
Language:	zh-TW
Published:	2015
Online Access:	http://ndltd.ncl.edu.tw/handle/14506487016496821137

id	ndltd-TW-103STU05396006
record_format	oai_dc
spelling	ndltd-TW-103STU053960062016-09-25T04:05:00Z http://ndltd.ncl.edu.tw/handle/14506487016496821137 Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet 以SentiWordNet為基礎比較TF與TF-IDF在電影評論分類結果的差異 Tsong-ming Chen 陳聰敏碩士樹德科技大學資訊管理系碩士班 103 Text mining often uses the TF-IDF (term frequence-inverse document frequency) technique to preprocess documents because of a common knowledge that “if a word has a high frequency in a document and does’nt appear in too many other documents, then the word has a good discriminatory power for text classification”. Therefore, common words can be filtered out in the process of text mining jobs. Blog articles come with text of a substantial length. When a word rarely appears in other articles, this word does have a good capability to distinguish text of different categories. Today, many review articles are created in microblog sites such as Twitter or plurk, which often restricts an article to a maximum of 140 characters. With such a short article, most words naturally appear in only a few articles creating the phenomenon of a high IDF. Is it still resonable to emphasize the IDF factor when we mine corpus with short text? In order to answer the above question, we set up experiments to detect the sentiment of movie reviews using SentiWordNet. Two classification algorithms (Naïve Bayesian and decision tree) have been applied to learn and predict the polarity of a movie review. It is found that TF-IDF features performed no better than TF features. Shing-Hwang Tung 董信煌 2015 學位論文 ; thesis 41 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 樹德科技大學 === 資訊管理系碩士班 === 103 === Text mining often uses the TF-IDF (term frequence-inverse document frequency) technique to preprocess documents because of a common knowledge that “if a word has a high frequency in a document and does’nt appear in too many other documents, then the word has a good discriminatory power for text classification”. Therefore, common words can be filtered out in the process of text mining jobs. Blog articles come with text of a substantial length. When a word rarely appears in other articles, this word does have a good capability to distinguish text of different categories. Today, many review articles are created in microblog sites such as Twitter or plurk, which often restricts an article to a maximum of 140 characters. With such a short article, most words naturally appear in only a few articles creating the phenomenon of a high IDF. Is it still resonable to emphasize the IDF factor when we mine corpus with short text? In order to answer the above question, we set up experiments to detect the sentiment of movie reviews using SentiWordNet. Two classification algorithms (Naïve Bayesian and decision tree) have been applied to learn and predict the polarity of a movie review. It is found that TF-IDF features performed no better than TF features.
author2	Shing-Hwang Tung
author_facet	Shing-Hwang Tung Tsong-ming Chen 陳聰敏
author	Tsong-ming Chen 陳聰敏
spellingShingle	Tsong-ming Chen 陳聰敏 Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
author_sort	Tsong-ming Chen
title	Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
title_short	Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
title_full	Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
title_fullStr	Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
title_full_unstemmed	Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
title_sort	movie opinion classification: an empirical comparison between using tf and using tf-idf base on sentiwordnet
publishDate	2015
url	http://ndltd.ncl.edu.tw/handle/14506487016496821137
work_keys_str_mv	AT tsongmingchen movieopinionclassificationanempiricalcomparisonbetweenusingtfandusingtfidfbaseonsentiwordnet AT chéncōngmǐn movieopinionclassificationanempiricalcomparisonbetweenusingtfandusingtfidfbaseonsentiwordnet AT tsongmingchen yǐsentiwordnetwèijīchǔbǐjiàotfyǔtfidfzàidiànyǐngpínglùnfēnlèijiéguǒdechàyì AT chéncōngmǐn yǐsentiwordnetwèijīchǔbǐjiàotfyǔtfidfzàidiànyǐngpínglùnfēnlèijiéguǒdechàyì
_version_	1718385542886850560

Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet

Similar Items