A Study of Web Spam Page Using Machine Learning

碩士 === 輔仁大學 === 資訊管理學系 === 99 === Web Search has become very important today, user uses search engine to search the web page they need. The rank of the search result is important for a web page, it represents the exposure of the web page. There are many ways to improve the rank of search result, but...

Full description

Bibliographic Details
Main Authors:	Lee, Shaw-fu, 李紹甫
Other Authors:	Tsai, Ming-jyh
Format:	Others
Language:	zh-TW
Published:	2011
Online Access:	http://ndltd.ncl.edu.tw/handle/33029180122271238396

id	ndltd-TW-099FJU00396065
record_format	oai_dc
spelling	ndltd-TW-099FJU003960652015-10-13T20:04:05Z http://ndltd.ncl.edu.tw/handle/33029180122271238396 A Study of Web Spam Page Using Machine Learning 應用機器學習有效偵測垃圾網頁之研究 Lee, Shaw-fu 李紹甫碩士輔仁大學資訊管理學系 99 Web Search has become very important today, user uses search engine to search the web page they need. The rank of the search result is important for a web page, it represents the exposure of the web page. There are many ways to improve the rank of search result, but when the improve technique are abused, the rank of the irrelevant web pages will be boost in the result page. It will cause the user can’t find the important page which they queried. We propose a machine learning approach to detect the spam page by applying the SVM and Neural Network as the classifier, and trying to use the genetic algorithm to optimize the set of the parameters and the features, classify the normal page and the spam page, and then compare their performance. At the end, we also try to apply the AdaBoost to improve the performance of classify. However, the number of normal pages and spam pages in the web spam data set are usually imbalance, and it will lead the classifier to have bad performance. Therefore, we apply the clustering-based data re-sampling, to produce a training data set that the ratio of spam pages and normal pages are equal, and try to increase the performance of the classification. We also propose using the different feature selection approach when clustering and classification, it will help the performance of the result of web spam detection. The experiment shows that (1) Using the clustering re-sampling to produce the equally training set, will solve the problem which caused by imbalanced web spam data set. (2) The set of parameters and features which evolutes by genetic algorithm, will improve the whole performance of the classification. (3) Using the different approach of feature selection in clustering and classification, is better than using only one approach. (4) The approach we proposed can detect the web spam page effectively. Tsai, Ming-jyh 蔡明志 2011 學位論文 ; thesis 66 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 輔仁大學 === 資訊管理學系 === 99 === Web Search has become very important today, user uses search engine to search the web page they need. The rank of the search result is important for a web page, it represents the exposure of the web page. There are many ways to improve the rank of search result, but when the improve technique are abused, the rank of the irrelevant web pages will be boost in the result page. It will cause the user can’t find the important page which they queried. We propose a machine learning approach to detect the spam page by applying the SVM and Neural Network as the classifier, and trying to use the genetic algorithm to optimize the set of the parameters and the features, classify the normal page and the spam page, and then compare their performance. At the end, we also try to apply the AdaBoost to improve the performance of classify. However, the number of normal pages and spam pages in the web spam data set are usually imbalance, and it will lead the classifier to have bad performance. Therefore, we apply the clustering-based data re-sampling, to produce a training data set that the ratio of spam pages and normal pages are equal, and try to increase the performance of the classification. We also propose using the different feature selection approach when clustering and classification, it will help the performance of the result of web spam detection. The experiment shows that (1) Using the clustering re-sampling to produce the equally training set, will solve the problem which caused by imbalanced web spam data set. (2) The set of parameters and features which evolutes by genetic algorithm, will improve the whole performance of the classification. (3) Using the different approach of feature selection in clustering and classification, is better than using only one approach. (4) The approach we proposed can detect the web spam page effectively.
author2	Tsai, Ming-jyh
author_facet	Tsai, Ming-jyh Lee, Shaw-fu 李紹甫
author	Lee, Shaw-fu 李紹甫
spellingShingle	Lee, Shaw-fu 李紹甫 A Study of Web Spam Page Using Machine Learning
author_sort	Lee, Shaw-fu
title	A Study of Web Spam Page Using Machine Learning
title_short	A Study of Web Spam Page Using Machine Learning
title_full	A Study of Web Spam Page Using Machine Learning
title_fullStr	A Study of Web Spam Page Using Machine Learning
title_full_unstemmed	A Study of Web Spam Page Using Machine Learning
title_sort	study of web spam page using machine learning
publishDate	2011
url	http://ndltd.ncl.edu.tw/handle/33029180122271238396
work_keys_str_mv	AT leeshawfu astudyofwebspampageusingmachinelearning AT lǐshàofǔ astudyofwebspampageusingmachinelearning AT leeshawfu yīngyòngjīqìxuéxíyǒuxiàozhēncèlājīwǎngyèzhīyánjiū AT lǐshàofǔ yīngyòngjīqìxuéxíyǒuxiàozhēncèlājīwǎngyèzhīyánjiū AT leeshawfu studyofwebspampageusingmachinelearning AT lǐshàofǔ studyofwebspampageusingmachinelearning
_version_	1718043455090851840

A Study of Web Spam Page Using Machine Learning

Similar Items