Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet

碩士 === 樹德科技大學 === 資訊管理系碩士班 === 103 === Text mining often uses the TF-IDF (term frequence-inverse document frequency) technique to preprocess documents because of a common knowledge that “if a word has a high frequency in a document and does’nt appear in too many other documents, then the word has a...

Full description

Bibliographic Details
Main Authors: Tsong-ming Chen, 陳聰敏
Other Authors: Shing-Hwang Tung
Format: Others
Language:zh-TW
Published: 2015
Online Access:http://ndltd.ncl.edu.tw/handle/14506487016496821137
id ndltd-TW-103STU05396006
record_format oai_dc
spelling ndltd-TW-103STU053960062016-09-25T04:05:00Z http://ndltd.ncl.edu.tw/handle/14506487016496821137 Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet 以SentiWordNet為基礎比較TF與TF-IDF在電影評論分類結果的差異 Tsong-ming Chen 陳聰敏 碩士 樹德科技大學 資訊管理系碩士班 103 Text mining often uses the TF-IDF (term frequence-inverse document frequency) technique to preprocess documents because of a common knowledge that “if a word has a high frequency in a document and does’nt appear in too many other documents, then the word has a good discriminatory power for text classification”. Therefore, common words can be filtered out in the process of text mining jobs. Blog articles come with text of a substantial length. When a word rarely appears in other articles, this word does have a good capability to distinguish text of different categories. Today, many review articles are created in microblog sites such as Twitter or plurk, which often restricts an article to a maximum of 140 characters. With such a short article, most words naturally appear in only a few articles creating the phenomenon of a high IDF. Is it still resonable to emphasize the IDF factor when we mine corpus with short text? In order to answer the above question, we set up experiments to detect the sentiment of movie reviews using SentiWordNet. Two classification algorithms (Naïve Bayesian and decision tree) have been applied to learn and predict the polarity of a movie review. It is found that TF-IDF features performed no better than TF features. Shing-Hwang Tung 董信煌 2015 學位論文 ; thesis 41 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 樹德科技大學 === 資訊管理系碩士班 === 103 === Text mining often uses the TF-IDF (term frequence-inverse document frequency) technique to preprocess documents because of a common knowledge that “if a word has a high frequency in a document and does’nt appear in too many other documents, then the word has a good discriminatory power for text classification”. Therefore, common words can be filtered out in the process of text mining jobs. Blog articles come with text of a substantial length. When a word rarely appears in other articles, this word does have a good capability to distinguish text of different categories. Today, many review articles are created in microblog sites such as Twitter or plurk, which often restricts an article to a maximum of 140 characters. With such a short article, most words naturally appear in only a few articles creating the phenomenon of a high IDF. Is it still resonable to emphasize the IDF factor when we mine corpus with short text? In order to answer the above question, we set up experiments to detect the sentiment of movie reviews using SentiWordNet. Two classification algorithms (Naïve Bayesian and decision tree) have been applied to learn and predict the polarity of a movie review. It is found that TF-IDF features performed no better than TF features.
author2 Shing-Hwang Tung
author_facet Shing-Hwang Tung
Tsong-ming Chen
陳聰敏
author Tsong-ming Chen
陳聰敏
spellingShingle Tsong-ming Chen
陳聰敏
Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
author_sort Tsong-ming Chen
title Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
title_short Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
title_full Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
title_fullStr Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
title_full_unstemmed Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
title_sort movie opinion classification: an empirical comparison between using tf and using tf-idf base on sentiwordnet
publishDate 2015
url http://ndltd.ncl.edu.tw/handle/14506487016496821137
work_keys_str_mv AT tsongmingchen movieopinionclassificationanempiricalcomparisonbetweenusingtfandusingtfidfbaseonsentiwordnet
AT chéncōngmǐn movieopinionclassificationanempiricalcomparisonbetweenusingtfandusingtfidfbaseonsentiwordnet
AT tsongmingchen yǐsentiwordnetwèijīchǔbǐjiàotfyǔtfidfzàidiànyǐngpínglùnfēnlèijiéguǒdechàyì
AT chéncōngmǐn yǐsentiwordnetwèijīchǔbǐjiàotfyǔtfidfzàidiànyǐngpínglùnfēnlèijiéguǒdechàyì
_version_ 1718385542886850560