Url Filtering for Data Crawler
碩士 === 國立中正大學 === 資訊工程所 === 97 === URL(Uniform Resource Locators) is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2009
|
Online Access: | http://ndltd.ncl.edu.tw/handle/54822156580202331575 |
id |
ndltd-TW-097CCU05392044 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-097CCU053920442016-05-04T04:26:08Z http://ndltd.ncl.edu.tw/handle/54822156580202331575 Url Filtering for Data Crawler 資料蒐集器的連結過濾方法 Hsien-Chang Lin 林献章 碩士 國立中正大學 資訊工程所 97 URL(Uniform Resource Locators) is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the crawling queue. This process is repeated until there are almost no more new urls to crawl. Although we can crawl very large amount of data, our server''s disk space could be exhausted and the quality of search engine will degrade due to large amount of garbage web pages if url filtering is not applied in the crawling process. In this thesis we will study the url filtering through URL analysis and content analysis. We will classify the urls, analyze them, and use statistical method to filter unwanted urls. We will describe the architecture of our process and discuss some issues needed to be noted for url filtering. Through our system we expect to effectively filter out urls that are not needed for search engine. These urls include invalid urls, urls with repeated content, redirection urls, porn content, spams, etc. With the url filteration, we can save time, space, bandwidth, and improve the quality of search engines. Sun Wu 吳昇 2009 學位論文 ; thesis 39 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中正大學 === 資訊工程所 === 97 === URL(Uniform Resource Locators) is an address used to locate resources and web pages on Internet. When a search engine crawls a large scale of data, it starts with a seed list. After the webpages are crawled, it will extract new urls from the webpages to add to the crawling queue. This process is repeated until there are almost no more new urls to crawl.
Although we can crawl very large amount of data, our server''s disk space could be exhausted and the quality of search engine will degrade due to large amount of garbage web pages if url filtering is not applied in the crawling process.
In this thesis we will study the url filtering through URL analysis and content analysis. We will classify the urls, analyze them, and use statistical method to filter unwanted urls. We will describe the architecture of our process and discuss some issues needed to be noted for url filtering.
Through our system we expect to effectively filter out urls that are not needed for search engine. These urls include invalid urls, urls with repeated content, redirection urls, porn content, spams, etc. With the url filteration, we can save time, space, bandwidth, and improve the quality of search engines.
|
author2 |
Sun Wu |
author_facet |
Sun Wu Hsien-Chang Lin 林献章 |
author |
Hsien-Chang Lin 林献章 |
spellingShingle |
Hsien-Chang Lin 林献章 Url Filtering for Data Crawler |
author_sort |
Hsien-Chang Lin |
title |
Url Filtering for Data Crawler |
title_short |
Url Filtering for Data Crawler |
title_full |
Url Filtering for Data Crawler |
title_fullStr |
Url Filtering for Data Crawler |
title_full_unstemmed |
Url Filtering for Data Crawler |
title_sort |
url filtering for data crawler |
publishDate |
2009 |
url |
http://ndltd.ncl.edu.tw/handle/54822156580202331575 |
work_keys_str_mv |
AT hsienchanglin urlfilteringfordatacrawler AT línxiànzhāng urlfilteringfordatacrawler AT hsienchanglin zīliàosōujíqìdeliánjiéguòlǜfāngfǎ AT línxiànzhāng zīliàosōujíqìdeliánjiéguòlǜfāngfǎ |
_version_ |
1718258374467911680 |