A Study of Web Spam Page Using Machine Learning

碩士 === 輔仁大學 === 資訊管理學系 === 99 === Web Search has become very important today, user uses search engine to search the web page they need. The rank of the search result is important for a web page, it represents the exposure of the web page. There are many ways to improve the rank of search result, but...

Full description

Bibliographic Details
Main Authors: Lee, Shaw-fu, 李紹甫
Other Authors: Tsai, Ming-jyh
Format: Others
Language:zh-TW
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/33029180122271238396
id ndltd-TW-099FJU00396065
record_format oai_dc
spelling ndltd-TW-099FJU003960652015-10-13T20:04:05Z http://ndltd.ncl.edu.tw/handle/33029180122271238396 A Study of Web Spam Page Using Machine Learning 應用機器學習有效偵測垃圾網頁之研究 Lee, Shaw-fu 李紹甫 碩士 輔仁大學 資訊管理學系 99 Web Search has become very important today, user uses search engine to search the web page they need. The rank of the search result is important for a web page, it represents the exposure of the web page. There are many ways to improve the rank of search result, but when the improve technique are abused, the rank of the irrelevant web pages will be boost in the result page. It will cause the user can’t find the important page which they queried. We propose a machine learning approach to detect the spam page by applying the SVM and Neural Network as the classifier, and trying to use the genetic algorithm to optimize the set of the parameters and the features, classify the normal page and the spam page, and then compare their performance. At the end, we also try to apply the AdaBoost to improve the performance of classify. However, the number of normal pages and spam pages in the web spam data set are usually imbalance, and it will lead the classifier to have bad performance. Therefore, we apply the clustering-based data re-sampling, to produce a training data set that the ratio of spam pages and normal pages are equal, and try to increase the performance of the classification. We also propose using the different feature selection approach when clustering and classification, it will help the performance of the result of web spam detection. The experiment shows that (1) Using the clustering re-sampling to produce the equally training set, will solve the problem which caused by imbalanced web spam data set. (2) The set of parameters and features which evolutes by genetic algorithm, will improve the whole performance of the classification. (3) Using the different approach of feature selection in clustering and classification, is better than using only one approach. (4) The approach we proposed can detect the web spam page effectively. Tsai, Ming-jyh 蔡明志 2011 學位論文 ; thesis 66 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 輔仁大學 === 資訊管理學系 === 99 === Web Search has become very important today, user uses search engine to search the web page they need. The rank of the search result is important for a web page, it represents the exposure of the web page. There are many ways to improve the rank of search result, but when the improve technique are abused, the rank of the irrelevant web pages will be boost in the result page. It will cause the user can’t find the important page which they queried. We propose a machine learning approach to detect the spam page by applying the SVM and Neural Network as the classifier, and trying to use the genetic algorithm to optimize the set of the parameters and the features, classify the normal page and the spam page, and then compare their performance. At the end, we also try to apply the AdaBoost to improve the performance of classify. However, the number of normal pages and spam pages in the web spam data set are usually imbalance, and it will lead the classifier to have bad performance. Therefore, we apply the clustering-based data re-sampling, to produce a training data set that the ratio of spam pages and normal pages are equal, and try to increase the performance of the classification. We also propose using the different feature selection approach when clustering and classification, it will help the performance of the result of web spam detection. The experiment shows that (1) Using the clustering re-sampling to produce the equally training set, will solve the problem which caused by imbalanced web spam data set. (2) The set of parameters and features which evolutes by genetic algorithm, will improve the whole performance of the classification. (3) Using the different approach of feature selection in clustering and classification, is better than using only one approach. (4) The approach we proposed can detect the web spam page effectively.
author2 Tsai, Ming-jyh
author_facet Tsai, Ming-jyh
Lee, Shaw-fu
李紹甫
author Lee, Shaw-fu
李紹甫
spellingShingle Lee, Shaw-fu
李紹甫
A Study of Web Spam Page Using Machine Learning
author_sort Lee, Shaw-fu
title A Study of Web Spam Page Using Machine Learning
title_short A Study of Web Spam Page Using Machine Learning
title_full A Study of Web Spam Page Using Machine Learning
title_fullStr A Study of Web Spam Page Using Machine Learning
title_full_unstemmed A Study of Web Spam Page Using Machine Learning
title_sort study of web spam page using machine learning
publishDate 2011
url http://ndltd.ncl.edu.tw/handle/33029180122271238396
work_keys_str_mv AT leeshawfu astudyofwebspampageusingmachinelearning
AT lǐshàofǔ astudyofwebspampageusingmachinelearning
AT leeshawfu yīngyòngjīqìxuéxíyǒuxiàozhēncèlājīwǎngyèzhīyánjiū
AT lǐshàofǔ yīngyòngjīqìxuéxíyǒuxiàozhēncèlājīwǎngyèzhīyánjiū
AT leeshawfu studyofwebspampageusingmachinelearning
AT lǐshàofǔ studyofwebspampageusingmachinelearning
_version_ 1718043455090851840