Extraction of semantic annotation document using text mining techniques in cloud computing environment

碩士 === 國立政治大學 === 資訊管理研究所 === 98 === Nowadays, businesses perform data mining and text mining need to handle large scale dataset. The computational resources of servers are often limited and lack of efficient to compute analytical jobs. But if they could run their data mining jobs under cloud comput...

Full description

Bibliographic Details
Main Author: 黃孝文
Other Authors: 楊建民
Format: Others
Language:zh-TW
Published: 2010
Online Access:http://ndltd.ncl.edu.tw/handle/02722318586696816814
id ndltd-TW-098NCCU5396017
record_format oai_dc
spelling ndltd-TW-098NCCU53960172015-10-13T18:16:14Z http://ndltd.ncl.edu.tw/handle/02722318586696816814 Extraction of semantic annotation document using text mining techniques in cloud computing environment 雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究 黃孝文 碩士 國立政治大學 資訊管理研究所 98 Nowadays, businesses perform data mining and text mining need to handle large scale dataset. The computational resources of servers are often limited and lack of efficient to compute analytical jobs. But if they could run their data mining jobs under cloud computing clusters, they are able to get results very quickly on a large dataset without "out of memory" problems. In this paper, a series of experiments are conducted to measure and analyze the accuracy of the classification algorithms implemented on Hadoop using Reuters-21578 dataset; the process of text mining consisted of four stages: (1)data preprocessing, (2)semantic annotation, (3)classifier, (4)evaluator. Reuters-21578 had divided into training set and testing set based on Mod Apte Split, processed by stopwords removal, appended semantic annotations as metadata and splitted into several subsets according to different document sizes. Experiments outlined several issues that will need to be considered when conducting text mining. According to the experiment results, the researcher found that stopwords removal, semantic annotation, different classification algorithms and different document sizes could improve the classification accuracy. First, stopwords removal avoids common words from becoming noises that will do harm to classification result. Second, semantic annotation as the extra information could improve the result. Third, complementary naive bayes algorithm could solve the decision boundary problem which naive bayesian cannot handle. Fourth, long documents could dominate the classification results. Sixth, the class imbalance problem could cause a drop of classification accuracy. Text mining result could be improved by adjusting the parameters found above. 楊建民 2010 學位論文 ; thesis 72 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立政治大學 === 資訊管理研究所 === 98 === Nowadays, businesses perform data mining and text mining need to handle large scale dataset. The computational resources of servers are often limited and lack of efficient to compute analytical jobs. But if they could run their data mining jobs under cloud computing clusters, they are able to get results very quickly on a large dataset without "out of memory" problems. In this paper, a series of experiments are conducted to measure and analyze the accuracy of the classification algorithms implemented on Hadoop using Reuters-21578 dataset; the process of text mining consisted of four stages: (1)data preprocessing, (2)semantic annotation, (3)classifier, (4)evaluator. Reuters-21578 had divided into training set and testing set based on Mod Apte Split, processed by stopwords removal, appended semantic annotations as metadata and splitted into several subsets according to different document sizes. Experiments outlined several issues that will need to be considered when conducting text mining. According to the experiment results, the researcher found that stopwords removal, semantic annotation, different classification algorithms and different document sizes could improve the classification accuracy. First, stopwords removal avoids common words from becoming noises that will do harm to classification result. Second, semantic annotation as the extra information could improve the result. Third, complementary naive bayes algorithm could solve the decision boundary problem which naive bayesian cannot handle. Fourth, long documents could dominate the classification results. Sixth, the class imbalance problem could cause a drop of classification accuracy. Text mining result could be improved by adjusting the parameters found above.
author2 楊建民
author_facet 楊建民
黃孝文
author 黃孝文
spellingShingle 黃孝文
Extraction of semantic annotation document using text mining techniques in cloud computing environment
author_sort 黃孝文
title Extraction of semantic annotation document using text mining techniques in cloud computing environment
title_short Extraction of semantic annotation document using text mining techniques in cloud computing environment
title_full Extraction of semantic annotation document using text mining techniques in cloud computing environment
title_fullStr Extraction of semantic annotation document using text mining techniques in cloud computing environment
title_full_unstemmed Extraction of semantic annotation document using text mining techniques in cloud computing environment
title_sort extraction of semantic annotation document using text mining techniques in cloud computing environment
publishDate 2010
url http://ndltd.ncl.edu.tw/handle/02722318586696816814
work_keys_str_mv AT huángxiàowén extractionofsemanticannotationdocumentusingtextminingtechniquesincloudcomputingenvironment
AT huángxiàowén yúnduānyùnsuànfúwùhuánjìngxiàyùnyòngwénzìtànkānyúyǔyìzhùjiěwǎngyèwénjiànfēnxīzhīyánjiū
_version_ 1718029402862780416