Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents

The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat fi...

Full description

Bibliographic Details
Main Author: Farfan, Fernando R
Format: Others
Published: FIU Digital Commons 2009
Subjects:
XML
Online Access:http://digitalcommons.fiu.edu/etd/126
http://digitalcommons.fiu.edu/cgi/viewcontent.cgi?article=1153&context=etd
id ndltd-fiu.edu-oai-digitalcommons.fiu.edu-etd-1153
record_format oai_dc
spelling ndltd-fiu.edu-oai-digitalcommons.fiu.edu-etd-11532018-07-19T03:31:31Z Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents Farfan, Fernando R The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model's parsing mechanism. The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents. 2009-11-12T08:00:00Z text application/pdf http://digitalcommons.fiu.edu/etd/126 http://digitalcommons.fiu.edu/cgi/viewcontent.cgi?article=1153&context=etd FIU Electronic Theses and Dissertations FIU Digital Commons Semistructured documents XML storage parsing information retrieval semisequental access lazy parsing ontologies Data Storage Systems Other Computer Engineering
collection NDLTD
format Others
sources NDLTD
topic Semistructured documents
XML
storage
parsing
information retrieval
semisequental access
lazy parsing
ontologies
Data Storage Systems
Other Computer Engineering
spellingShingle Semistructured documents
XML
storage
parsing
information retrieval
semisequental access
lazy parsing
ontologies
Data Storage Systems
Other Computer Engineering
Farfan, Fernando R
Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
description The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model's parsing mechanism. The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents.
author Farfan, Fernando R
author_facet Farfan, Fernando R
author_sort Farfan, Fernando R
title Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
title_short Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
title_full Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
title_fullStr Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
title_full_unstemmed Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
title_sort efficient storage and domain-specific information discovery on semistructured documents
publisher FIU Digital Commons
publishDate 2009
url http://digitalcommons.fiu.edu/etd/126
http://digitalcommons.fiu.edu/cgi/viewcontent.cgi?article=1153&context=etd
work_keys_str_mv AT farfanfernandor efficientstorageanddomainspecificinformationdiscoveryonsemistructureddocuments
_version_ 1718712581756026880