Url Filtering for Data Crawler

碩士 === 國立中正大學 === 資訊工程所 === 97 === URL(Uniform Resource Locators) is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the...

Full description

Bibliographic Details
Main Authors: Hsien-Chang Lin, 林献章
Other Authors: Sun Wu
Format: Others
Language:zh-TW
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/54822156580202331575
Description
Summary:碩士 === 國立中正大學 === 資訊工程所 === 97 === URL(Uniform Resource Locators) is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the crawling queue. This process is repeated until there are almost no more new urls to crawl. Although we can crawl very large amount of data, our server''s disk space could be exhausted and the quality of search engine will degrade due to large amount of garbage web pages if url filtering is not applied in the crawling process. In this thesis we will study the url filtering through URL analysis and content analysis. We will classify the urls, analyze them, and use statistical method to filter unwanted urls. We will describe the architecture of our process and discuss some issues needed to be noted for url filtering. Through our system we expect to effectively filter out urls that are not needed for search engine. These urls include invalid urls, urls with repeated content, redirection urls, porn content, spams, etc. With the url filteration, we can save time, space, bandwidth, and improve the quality of search engines.