Link Extraction for Crawling Flash on the Web

The set of web pages not reachable using conventional web search engines is usually called the hidden or deep web. One client-side hurdle for crawling the hidden web is Flash files. This thesis presents a tool for extracting links from Flash files up to version 8 to enable web crawling. The files ar...

Full description

Bibliographic Details
Main Author: Antelius, Daniel
Format: Others
Language:English
Published: Linköpings universitet, Institutionen för datavetenskap 2015
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-117604
id ndltd-UPSALLA1-oai-DiVA.org-liu-117604
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-liu-1176042018-01-12T05:09:37ZLink Extraction for Crawling Flash on the WebengAntelius, DanielLinköpings universitet, Institutionen för datavetenskapLinköpings universitet, Tekniska högskolan2015Flashcrawlingspideringdeep webhidden webvirtual machineinterpretationComputer SciencesDatavetenskap (datalogi)The set of web pages not reachable using conventional web search engines is usually called the hidden or deep web. One client-side hurdle for crawling the hidden web is Flash files. This thesis presents a tool for extracting links from Flash files up to version 8 to enable web crawling. The files are both parsed and selectively interpreted to extract links. The purpose of the interpretation is to simulate the normal execution of Flash in the Flash runtime of a web browser. The interpretation is a low level approach that allows the extraction to occur offline and without involving automation of web browsers. A virtual machine is implemented and a set of limitations is chosen to reduce development time and maximize the coverage of interpreted byte code. Out of a test set of about 3500 randomly sampled Flash files the link extractor found links in 34% of the files. The resulting estimated web search engine coverage improvement is almost 10%. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-117604application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Flash
crawling
spidering
deep web
hidden web
virtual machine
interpretation
Computer Sciences
Datavetenskap (datalogi)
spellingShingle Flash
crawling
spidering
deep web
hidden web
virtual machine
interpretation
Computer Sciences
Datavetenskap (datalogi)
Antelius, Daniel
Link Extraction for Crawling Flash on the Web
description The set of web pages not reachable using conventional web search engines is usually called the hidden or deep web. One client-side hurdle for crawling the hidden web is Flash files. This thesis presents a tool for extracting links from Flash files up to version 8 to enable web crawling. The files are both parsed and selectively interpreted to extract links. The purpose of the interpretation is to simulate the normal execution of Flash in the Flash runtime of a web browser. The interpretation is a low level approach that allows the extraction to occur offline and without involving automation of web browsers. A virtual machine is implemented and a set of limitations is chosen to reduce development time and maximize the coverage of interpreted byte code. Out of a test set of about 3500 randomly sampled Flash files the link extractor found links in 34% of the files. The resulting estimated web search engine coverage improvement is almost 10%.
author Antelius, Daniel
author_facet Antelius, Daniel
author_sort Antelius, Daniel
title Link Extraction for Crawling Flash on the Web
title_short Link Extraction for Crawling Flash on the Web
title_full Link Extraction for Crawling Flash on the Web
title_fullStr Link Extraction for Crawling Flash on the Web
title_full_unstemmed Link Extraction for Crawling Flash on the Web
title_sort link extraction for crawling flash on the web
publisher Linköpings universitet, Institutionen för datavetenskap
publishDate 2015
url http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-117604
work_keys_str_mv AT anteliusdaniel linkextractionforcrawlingflashontheweb
_version_ 1718604870648332288