Extraction of semantic annotation document using text mining techniques in cloud computing environment

碩士 === 國立政治大學 === 資訊管理研究所 === 98 === Nowadays, businesses perform data mining and text mining need to handle large scale dataset. The computational resources of servers are often limited and lack of efficient to compute analytical jobs. But if they could run their data mining jobs under cloud comput...

Full description

Bibliographic Details
Main Author:	黃孝文
Other Authors:	楊建民
Format:	Others
Language:	zh-TW
Published:	2010
Online Access:	http://ndltd.ncl.edu.tw/handle/02722318586696816814

id	ndltd-TW-098NCCU5396017
record_format	oai_dc
spelling	ndltd-TW-098NCCU53960172015-10-13T18:16:14Z http://ndltd.ncl.edu.tw/handle/02722318586696816814 Extraction of semantic annotation document using text mining techniques in cloud computing environment 雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究黃孝文碩士國立政治大學資訊管理研究所 98 Nowadays, businesses perform data mining and text mining need to handle large scale dataset. The computational resources of servers are often limited and lack of efficient to compute analytical jobs. But if they could run their data mining jobs under cloud computing clusters, they are able to get results very quickly on a large dataset without "out of memory" problems. In this paper, a series of experiments are conducted to measure and analyze the accuracy of the classification algorithms implemented on Hadoop using Reuters-21578 dataset; the process of text mining consisted of four stages: (1)data preprocessing, (2)semantic annotation, (3)classifier, (4)evaluator. Reuters-21578 had divided into training set and testing set based on Mod Apte Split, processed by stopwords removal, appended semantic annotations as metadata and splitted into several subsets according to different document sizes. Experiments outlined several issues that will need to be considered when conducting text mining. According to the experiment results, the researcher found that stopwords removal, semantic annotation, different classification algorithms and different document sizes could improve the classification accuracy. First, stopwords removal avoids common words from becoming noises that will do harm to classification result. Second, semantic annotation as the extra information could improve the result. Third, complementary naive bayes algorithm could solve the decision boundary problem which naive bayesian cannot handle. Fourth, long documents could dominate the classification results. Sixth, the class imbalance problem could cause a drop of classification accuracy. Text mining result could be improved by adjusting the parameters found above. 楊建民 2010 學位論文 ; thesis 72 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立政治大學 === 資訊管理研究所 === 98 === Nowadays, businesses perform data mining and text mining need to handle large scale dataset. The computational resources of servers are often limited and lack of efficient to compute analytical jobs. But if they could run their data mining jobs under cloud computing clusters, they are able to get results very quickly on a large dataset without "out of memory" problems. In this paper, a series of experiments are conducted to measure and analyze the accuracy of the classification algorithms implemented on Hadoop using Reuters-21578 dataset; the process of text mining consisted of four stages: (1)data preprocessing, (2)semantic annotation, (3)classifier, (4)evaluator. Reuters-21578 had divided into training set and testing set based on Mod Apte Split, processed by stopwords removal, appended semantic annotations as metadata and splitted into several subsets according to different document sizes. Experiments outlined several issues that will need to be considered when conducting text mining. According to the experiment results, the researcher found that stopwords removal, semantic annotation, different classification algorithms and different document sizes could improve the classification accuracy. First, stopwords removal avoids common words from becoming noises that will do harm to classification result. Second, semantic annotation as the extra information could improve the result. Third, complementary naive bayes algorithm could solve the decision boundary problem which naive bayesian cannot handle. Fourth, long documents could dominate the classification results. Sixth, the class imbalance problem could cause a drop of classification accuracy. Text mining result could be improved by adjusting the parameters found above.
author2	楊建民
author_facet	楊建民黃孝文
author	黃孝文
spellingShingle	黃孝文 Extraction of semantic annotation document using text mining techniques in cloud computing environment
author_sort	黃孝文
title	Extraction of semantic annotation document using text mining techniques in cloud computing environment
title_short	Extraction of semantic annotation document using text mining techniques in cloud computing environment
title_full	Extraction of semantic annotation document using text mining techniques in cloud computing environment
title_fullStr	Extraction of semantic annotation document using text mining techniques in cloud computing environment
title_full_unstemmed	Extraction of semantic annotation document using text mining techniques in cloud computing environment
title_sort	extraction of semantic annotation document using text mining techniques in cloud computing environment
publishDate	2010
url	http://ndltd.ncl.edu.tw/handle/02722318586696816814
work_keys_str_mv	AT huángxiàowén extractionofsemanticannotationdocumentusingtextminingtechniquesincloudcomputingenvironment AT huángxiàowén yúnduānyùnsuànfúwùhuánjìngxiàyùnyòngwénzìtànkānyúyǔyìzhùjiěwǎngyèwénjiànfēnxīzhīyánjiū
_version_	1718029402862780416

Extraction of semantic annotation document using text mining techniques in cloud computing environment

Similar Items