Url Filtering for Data Crawler

碩士 === 國立中正大學 === 資訊工程所 === 97 === URL(Uniform Resource Locators) is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the...

Full description

Bibliographic Details
Main Authors: Hsien-Chang Lin, 林献章
Other Authors: Sun Wu
Format: Others
Language:zh-TW
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/54822156580202331575
id ndltd-TW-097CCU05392044
record_format oai_dc
spelling ndltd-TW-097CCU053920442016-05-04T04:26:08Z http://ndltd.ncl.edu.tw/handle/54822156580202331575 Url Filtering for Data Crawler 資料蒐集器的連結過濾方法 Hsien-Chang Lin 林献章 碩士 國立中正大學 資訊工程所 97 URL(Uniform Resource Locators) is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the crawling queue. This process is repeated until there are almost no more new urls to crawl. Although we can crawl very large amount of data, our server''s disk space could be exhausted and the quality of search engine will degrade due to large amount of garbage web pages if url filtering is not applied in the crawling process. In this thesis we will study the url filtering through URL analysis and content analysis. We will classify the urls, analyze them, and use statistical method to filter unwanted urls. We will describe the architecture of our process and discuss some issues needed to be noted for url filtering. Through our system we expect to effectively filter out urls that are not needed for search engine. These urls include invalid urls, urls with repeated content, redirection urls, porn content, spams, etc. With the url filteration, we can save time, space, bandwidth, and improve the quality of search engines. Sun Wu 吳昇 2009 學位論文 ; thesis 39 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中正大學 === 資訊工程所 === 97 === URL(Uniform Resource Locators) is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the crawling queue. This process is repeated until there are almost no more new urls to crawl. Although we can crawl very large amount of data, our server''s disk space could be exhausted and the quality of search engine will degrade due to large amount of garbage web pages if url filtering is not applied in the crawling process. In this thesis we will study the url filtering through URL analysis and content analysis. We will classify the urls, analyze them, and use statistical method to filter unwanted urls. We will describe the architecture of our process and discuss some issues needed to be noted for url filtering. Through our system we expect to effectively filter out urls that are not needed for search engine. These urls include invalid urls, urls with repeated content, redirection urls, porn content, spams, etc. With the url filteration, we can save time, space, bandwidth, and improve the quality of search engines.
author2 Sun Wu
author_facet Sun Wu
Hsien-Chang Lin
林献章
author Hsien-Chang Lin
林献章
spellingShingle Hsien-Chang Lin
林献章
Url Filtering for Data Crawler
author_sort Hsien-Chang Lin
title Url Filtering for Data Crawler
title_short Url Filtering for Data Crawler
title_full Url Filtering for Data Crawler
title_fullStr Url Filtering for Data Crawler
title_full_unstemmed Url Filtering for Data Crawler
title_sort url filtering for data crawler
publishDate 2009
url http://ndltd.ncl.edu.tw/handle/54822156580202331575
work_keys_str_mv AT hsienchanglin urlfilteringfordatacrawler
AT línxiànzhāng urlfilteringfordatacrawler
AT hsienchanglin zīliàosōujíqìdeliánjiéguòlǜfāngfǎ
AT línxiànzhāng zīliàosōujíqìdeliánjiéguòlǜfāngfǎ
_version_ 1718258374467911680