Extraction of semantic annotation document using text mining techniques in cloud computing environment

碩士 === 國立政治大學 === 資訊管理研究所 === 98 === Nowadays, businesses perform data mining and text mining need to handle large scale dataset. The computational resources of servers are often limited and lack of efficient to compute analytical jobs. But if they could run their data mining jobs under cloud comput...

Full description

Bibliographic Details
Main Author: 黃孝文
Other Authors: 楊建民
Format: Others
Language:zh-TW
Published: 2010
Online Access:http://ndltd.ncl.edu.tw/handle/02722318586696816814
Description
Summary:碩士 === 國立政治大學 === 資訊管理研究所 === 98 === Nowadays, businesses perform data mining and text mining need to handle large scale dataset. The computational resources of servers are often limited and lack of efficient to compute analytical jobs. But if they could run their data mining jobs under cloud computing clusters, they are able to get results very quickly on a large dataset without "out of memory" problems. In this paper, a series of experiments are conducted to measure and analyze the accuracy of the classification algorithms implemented on Hadoop using Reuters-21578 dataset; the process of text mining consisted of four stages: (1)data preprocessing, (2)semantic annotation, (3)classifier, (4)evaluator. Reuters-21578 had divided into training set and testing set based on Mod Apte Split, processed by stopwords removal, appended semantic annotations as metadata and splitted into several subsets according to different document sizes. Experiments outlined several issues that will need to be considered when conducting text mining. According to the experiment results, the researcher found that stopwords removal, semantic annotation, different classification algorithms and different document sizes could improve the classification accuracy. First, stopwords removal avoids common words from becoming noises that will do harm to classification result. Second, semantic annotation as the extra information could improve the result. Third, complementary naive bayes algorithm could solve the decision boundary problem which naive bayesian cannot handle. Fourth, long documents could dominate the classification results. Sixth, the class imbalance problem could cause a drop of classification accuracy. Text mining result could be improved by adjusting the parameters found above.