Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents

The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat fi...

Full description

Bibliographic Details
Main Author:	Farfan, Fernando R
Format:	Others
Published:	FIU Digital Commons 2009
Subjects:	Semistructured documents XML storage parsing information retrieval semisequental access lazy parsing ontologies Data Storage Systems Other Computer Engineering
Online Access:	http://digitalcommons.fiu.edu/etd/126 http://digitalcommons.fiu.edu/cgi/viewcontent.cgi?article=1153&context=etd

id	ndltd-fiu.edu-oai-digitalcommons.fiu.edu-etd-1153
record_format	oai_dc
spelling	ndltd-fiu.edu-oai-digitalcommons.fiu.edu-etd-11532018-07-19T03:31:31Z Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents Farfan, Fernando R The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model's parsing mechanism. The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents. 2009-11-12T08:00:00Z text application/pdf http://digitalcommons.fiu.edu/etd/126 http://digitalcommons.fiu.edu/cgi/viewcontent.cgi?article=1153&context=etd FIU Electronic Theses and Dissertations FIU Digital Commons Semistructured documents XML storage parsing information retrieval semisequental access lazy parsing ontologies Data Storage Systems Other Computer Engineering
collection	NDLTD
format	Others
sources	NDLTD
topic	Semistructured documents XML storage parsing information retrieval semisequental access lazy parsing ontologies Data Storage Systems Other Computer Engineering
spellingShingle	Semistructured documents XML storage parsing information retrieval semisequental access lazy parsing ontologies Data Storage Systems Other Computer Engineering Farfan, Fernando R Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
description	The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model's parsing mechanism. The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents.
author	Farfan, Fernando R
author_facet	Farfan, Fernando R
author_sort	Farfan, Fernando R
title	Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
title_short	Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
title_full	Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
title_fullStr	Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
title_full_unstemmed	Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
title_sort	efficient storage and domain-specific information discovery on semistructured documents
publisher	FIU Digital Commons
publishDate	2009
url	http://digitalcommons.fiu.edu/etd/126 http://digitalcommons.fiu.edu/cgi/viewcontent.cgi?article=1153&context=etd
work_keys_str_mv	AT farfanfernandor efficientstorageanddomainspecificinformationdiscoveryonsemistructureddocuments
_version_	1718712581756026880

Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents

Similar Items