Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet
碩士 === 樹德科技大學 === 資訊管理系碩士班 === 103 === Text mining often uses the TF-IDF (term frequence-inverse document frequency) technique to preprocess documents because of a common knowledge that “if a word has a high frequency in a document and does’nt appear in too many other documents, then the word has a...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2015
|
Online Access: | http://ndltd.ncl.edu.tw/handle/14506487016496821137 |
id |
ndltd-TW-103STU05396006 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-103STU053960062016-09-25T04:05:00Z http://ndltd.ncl.edu.tw/handle/14506487016496821137 Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet 以SentiWordNet為基礎比較TF與TF-IDF在電影評論分類結果的差異 Tsong-ming Chen 陳聰敏 碩士 樹德科技大學 資訊管理系碩士班 103 Text mining often uses the TF-IDF (term frequence-inverse document frequency) technique to preprocess documents because of a common knowledge that “if a word has a high frequency in a document and does’nt appear in too many other documents, then the word has a good discriminatory power for text classification”. Therefore, common words can be filtered out in the process of text mining jobs. Blog articles come with text of a substantial length. When a word rarely appears in other articles, this word does have a good capability to distinguish text of different categories. Today, many review articles are created in microblog sites such as Twitter or plurk, which often restricts an article to a maximum of 140 characters. With such a short article, most words naturally appear in only a few articles creating the phenomenon of a high IDF. Is it still resonable to emphasize the IDF factor when we mine corpus with short text? In order to answer the above question, we set up experiments to detect the sentiment of movie reviews using SentiWordNet. Two classification algorithms (Naïve Bayesian and decision tree) have been applied to learn and predict the polarity of a movie review. It is found that TF-IDF features performed no better than TF features. Shing-Hwang Tung 董信煌 2015 學位論文 ; thesis 41 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 樹德科技大學 === 資訊管理系碩士班 === 103 === Text mining often uses the TF-IDF (term frequence-inverse document frequency) technique to preprocess documents because of a common knowledge that “if a word has a high frequency in a document and does’nt appear in too many other documents, then the word has a good discriminatory power for text classification”. Therefore, common words can be filtered out in the process of text mining jobs.
Blog articles come with text of a substantial length. When a word rarely appears in other articles, this word does have a good capability to distinguish text of different categories. Today, many review articles are created in microblog sites such as Twitter or plurk, which often restricts an article to a maximum of 140 characters. With such a short article, most words naturally appear in only a few articles creating the phenomenon of a high IDF. Is it still resonable to emphasize the IDF factor when we mine corpus with short text?
In order to answer the above question, we set up experiments to detect the sentiment of movie reviews using SentiWordNet. Two classification algorithms (Naïve Bayesian and decision tree) have been applied to learn and predict the polarity of a movie review. It is found that TF-IDF features performed no better than TF features.
|
author2 |
Shing-Hwang Tung |
author_facet |
Shing-Hwang Tung Tsong-ming Chen 陳聰敏 |
author |
Tsong-ming Chen 陳聰敏 |
spellingShingle |
Tsong-ming Chen 陳聰敏 Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet |
author_sort |
Tsong-ming Chen |
title |
Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet |
title_short |
Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet |
title_full |
Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet |
title_fullStr |
Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet |
title_full_unstemmed |
Movie Opinion Classification: An empirical comparison between using TF and using TF-IDF base on SentiWordNet |
title_sort |
movie opinion classification: an empirical comparison between using tf and using tf-idf base on sentiwordnet |
publishDate |
2015 |
url |
http://ndltd.ncl.edu.tw/handle/14506487016496821137 |
work_keys_str_mv |
AT tsongmingchen movieopinionclassificationanempiricalcomparisonbetweenusingtfandusingtfidfbaseonsentiwordnet AT chéncōngmǐn movieopinionclassificationanempiricalcomparisonbetweenusingtfandusingtfidfbaseonsentiwordnet AT tsongmingchen yǐsentiwordnetwèijīchǔbǐjiàotfyǔtfidfzàidiànyǐngpínglùnfēnlèijiéguǒdechàyì AT chéncōngmǐn yǐsentiwordnetwèijīchǔbǐjiàotfyǔtfidfzàidiànyǐngpínglùnfēnlèijiéguǒdechàyì |
_version_ |
1718385542886850560 |