A Distributed Approach to Crawl Domain Specific Hidden Web

A large amount of on-line information resides on the invisible web - web pages generated dynamically from databases and other data sources hidden from current crawlers which retrieve content only from the publicly indexable Web. Specially, they ignore the tremendous amount of high quality content &q...

Full description

Bibliographic Details
Main Author: Desai, Lovekeshkumar
Format: Others
Published: Digital Archive @ GSU 2007
Subjects:
Online Access:http://digitalarchive.gsu.edu/cs_theses/47
http://digitalarchive.gsu.edu/cgi/viewcontent.cgi?article=1046&context=cs_theses
id ndltd-GEORGIA-oai-digitalarchive.gsu.edu-cs_theses-1046
record_format oai_dc
spelling ndltd-GEORGIA-oai-digitalarchive.gsu.edu-cs_theses-10462013-04-23T03:19:20Z A Distributed Approach to Crawl Domain Specific Hidden Web Desai, Lovekeshkumar A large amount of on-line information resides on the invisible web - web pages generated dynamically from databases and other data sources hidden from current crawlers which retrieve content only from the publicly indexable Web. Specially, they ignore the tremendous amount of high quality content "hidden" behind search forms, and pages that require authorization or prior registration in large searchable electronic databases. To extracting data from the hidden web, it is necessary to find the search forms and fill them with appropriate information to retrieve maximum relevant information. To fulfill the complex challenges that arise when attempting to search hidden web i.e. lots of analysis of search forms as well as retrieved information also, it becomes eminent to design and implement a distributed web crawler that runs on a network of workstations to extract data from hidden web. We describe the software architecture of the distributed and scalable system and also present a number of novel techniques that went into its design and implementation to extract maximum relevant data from hidden web for achieving high performance. 2007-08-03 text application/pdf http://digitalarchive.gsu.edu/cs_theses/47 http://digitalarchive.gsu.edu/cgi/viewcontent.cgi?article=1046&context=cs_theses Computer Science Theses Digital Archive @ GSU Deep Web Breadth-first crawler Search spider Distributed Web crawler task-specific and Domain Specific Hidden Web Content Extraction Computer Sciences
collection NDLTD
format Others
sources NDLTD
topic Deep Web
Breadth-first crawler
Search spider
Distributed Web crawler
task-specific and Domain Specific
Hidden Web
Content Extraction
Computer Sciences
spellingShingle Deep Web
Breadth-first crawler
Search spider
Distributed Web crawler
task-specific and Domain Specific
Hidden Web
Content Extraction
Computer Sciences
Desai, Lovekeshkumar
A Distributed Approach to Crawl Domain Specific Hidden Web
description A large amount of on-line information resides on the invisible web - web pages generated dynamically from databases and other data sources hidden from current crawlers which retrieve content only from the publicly indexable Web. Specially, they ignore the tremendous amount of high quality content "hidden" behind search forms, and pages that require authorization or prior registration in large searchable electronic databases. To extracting data from the hidden web, it is necessary to find the search forms and fill them with appropriate information to retrieve maximum relevant information. To fulfill the complex challenges that arise when attempting to search hidden web i.e. lots of analysis of search forms as well as retrieved information also, it becomes eminent to design and implement a distributed web crawler that runs on a network of workstations to extract data from hidden web. We describe the software architecture of the distributed and scalable system and also present a number of novel techniques that went into its design and implementation to extract maximum relevant data from hidden web for achieving high performance.
author Desai, Lovekeshkumar
author_facet Desai, Lovekeshkumar
author_sort Desai, Lovekeshkumar
title A Distributed Approach to Crawl Domain Specific Hidden Web
title_short A Distributed Approach to Crawl Domain Specific Hidden Web
title_full A Distributed Approach to Crawl Domain Specific Hidden Web
title_fullStr A Distributed Approach to Crawl Domain Specific Hidden Web
title_full_unstemmed A Distributed Approach to Crawl Domain Specific Hidden Web
title_sort distributed approach to crawl domain specific hidden web
publisher Digital Archive @ GSU
publishDate 2007
url http://digitalarchive.gsu.edu/cs_theses/47
http://digitalarchive.gsu.edu/cgi/viewcontent.cgi?article=1046&context=cs_theses
work_keys_str_mv AT desailovekeshkumar adistributedapproachtocrawldomainspecifichiddenweb
AT desailovekeshkumar distributedapproachtocrawldomainspecifichiddenweb
_version_ 1716583975779041280