Url Filtering for Data Crawler

碩士 === 國立中正大學 === 資訊工程所 === 97 === URL（Uniform Resource Locators） is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the...

Full description

Bibliographic Details
Main Authors:	Hsien-Chang Lin, 林献章
Other Authors:	Sun Wu
Format:	Others
Language:	zh-TW
Published:	2009
Online Access:	http://ndltd.ncl.edu.tw/handle/54822156580202331575

id	ndltd-TW-097CCU05392044
record_format	oai_dc
spelling	ndltd-TW-097CCU053920442016-05-04T04:26:08Z http://ndltd.ncl.edu.tw/handle/54822156580202331575 Url Filtering for Data Crawler 資料蒐集器的連結過濾方法 Hsien-Chang Lin 林献章碩士國立中正大學資訊工程所 97 URL（Uniform Resource Locators） is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the crawling queue. This process is repeated until there are almost no more new urls to crawl. Although we can crawl very large amount of data, our server''s disk space could be exhausted and the quality of search engine will degrade due to large amount of garbage web pages if url filtering is not applied in the crawling process. In this thesis we will study the url filtering through URL analysis and content analysis. We will classify the urls, analyze them, and use statistical method to filter unwanted urls. We will describe the architecture of our process and discuss some issues needed to be noted for url filtering. Through our system we expect to effectively filter out urls that are not needed for search engine. These urls include invalid urls, urls with repeated content, redirection urls, porn content, spams, etc. With the url filteration, we can save time, space, bandwidth, and improve the quality of search engines. Sun Wu 吳昇 2009 學位論文 ; thesis 39 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中正大學 === 資訊工程所 === 97 === URL（Uniform Resource Locators） is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the crawling queue. This process is repeated until there are almost no more new urls to crawl. Although we can crawl very large amount of data, our server''s disk space could be exhausted and the quality of search engine will degrade due to large amount of garbage web pages if url filtering is not applied in the crawling process. In this thesis we will study the url filtering through URL analysis and content analysis. We will classify the urls, analyze them, and use statistical method to filter unwanted urls. We will describe the architecture of our process and discuss some issues needed to be noted for url filtering. Through our system we expect to effectively filter out urls that are not needed for search engine. These urls include invalid urls, urls with repeated content, redirection urls, porn content, spams, etc. With the url filteration, we can save time, space, bandwidth, and improve the quality of search engines.
author2	Sun Wu
author_facet	Sun Wu Hsien-Chang Lin 林献章
author	Hsien-Chang Lin 林献章
spellingShingle	Hsien-Chang Lin 林献章 Url Filtering for Data Crawler
author_sort	Hsien-Chang Lin
title	Url Filtering for Data Crawler
title_short	Url Filtering for Data Crawler
title_full	Url Filtering for Data Crawler
title_fullStr	Url Filtering for Data Crawler
title_full_unstemmed	Url Filtering for Data Crawler
title_sort	url filtering for data crawler
publishDate	2009
url	http://ndltd.ncl.edu.tw/handle/54822156580202331575
work_keys_str_mv	AT hsienchanglin urlfilteringfordatacrawler AT línxiànzhāng urlfilteringfordatacrawler AT hsienchanglin zīliàosōujíqìdeliánjiéguòlǜfāngfǎ AT línxiànzhāng zīliàosōujíqìdeliánjiéguòlǜfāngfǎ
_version_	1718258374467911680

Url Filtering for Data Crawler

Similar Items