Efficient Disk-Based Techniques for Manipulating Very Large String Databases

Indexing and processing strings are very important topics in database management. Strings can be database records, DNA sequences, protein sequences, or plain text. Various string operations are required for several application categories, such as bioinformatics and entity resolution. When the string...

Full description

Bibliographic Details
Main Author:	Allam, Amin
Other Authors:	Kalnis, Panos
Language:	en
Published:	2017
Subjects:	large databases string processing disk-based Suffix tree record linkage error correction
Online Access:	http://hdl.handle.net/10754/623691 http://repository.kaust.edu.sa/kaust/handle/10754/623691

id	ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-623691
record_format	oai_dc
spelling	ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-6236912017-05-25T04:03:37Z Efficient Disk-Based Techniques for Manipulating Very Large String Databases Allam, Amin Kalnis, Panos Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division Gao, Xin Moshkov, Mikhail Mokbel, Mohamed large databases string processing disk-based Suffix tree record linkage error correction Indexing and processing strings are very important topics in database management. Strings can be database records, DNA sequences, protein sequences, or plain text. Various string operations are required for several application categories, such as bioinformatics and entity resolution. When the string count or sizes become very large, several state-of-the-art techniques for indexing and processing such strings may fail or behave very inefficiently. Modifying an existing technique to overcome these issues is not usually straightforward or even possible. A category of string operations can be facilitated by the suffix tree data structure, which basically indexes a long string to enable efficient finding of any substring of the indexed string, and can be used in other operations as well, such as approximate string matching. In this document, we introduce a novel efficient method to construct the suffix tree index for very long strings using parallel architectures, which is a major challenge in this category. Another category of string operations require clustering similar strings in order to perform application-specific processing on the resulting possibly-overlapping clusters. In this document, based on clustering similar strings, we introduce a novel efficient technique for record linkage and entity resolution, and a novel method for correcting errors in a large number of small strings (read sequences) generated by the DNA sequencing machines. 2017-05-18 Dissertation http://hdl.handle.net/10754/623691 http://repository.kaust.edu.sa/kaust/handle/10754/623691 en
collection	NDLTD
language	en
sources	NDLTD
topic	large databases string processing disk-based Suffix tree record linkage error correction
spellingShingle	large databases string processing disk-based Suffix tree record linkage error correction Allam, Amin Efficient Disk-Based Techniques for Manipulating Very Large String Databases
description	Indexing and processing strings are very important topics in database management. Strings can be database records, DNA sequences, protein sequences, or plain text. Various string operations are required for several application categories, such as bioinformatics and entity resolution. When the string count or sizes become very large, several state-of-the-art techniques for indexing and processing such strings may fail or behave very inefficiently. Modifying an existing technique to overcome these issues is not usually straightforward or even possible. A category of string operations can be facilitated by the suffix tree data structure, which basically indexes a long string to enable efficient finding of any substring of the indexed string, and can be used in other operations as well, such as approximate string matching. In this document, we introduce a novel efficient method to construct the suffix tree index for very long strings using parallel architectures, which is a major challenge in this category. Another category of string operations require clustering similar strings in order to perform application-specific processing on the resulting possibly-overlapping clusters. In this document, based on clustering similar strings, we introduce a novel efficient technique for record linkage and entity resolution, and a novel method for correcting errors in a large number of small strings (read sequences) generated by the DNA sequencing machines.
author2	Kalnis, Panos
author_facet	Kalnis, Panos Allam, Amin
author	Allam, Amin
author_sort	Allam, Amin
title	Efficient Disk-Based Techniques for Manipulating Very Large String Databases
title_short	Efficient Disk-Based Techniques for Manipulating Very Large String Databases
title_full	Efficient Disk-Based Techniques for Manipulating Very Large String Databases
title_fullStr	Efficient Disk-Based Techniques for Manipulating Very Large String Databases
title_full_unstemmed	Efficient Disk-Based Techniques for Manipulating Very Large String Databases
title_sort	efficient disk-based techniques for manipulating very large string databases
publishDate	2017
url	http://hdl.handle.net/10754/623691 http://repository.kaust.edu.sa/kaust/handle/10754/623691
work_keys_str_mv	AT allamamin efficientdiskbasedtechniquesformanipulatingverylargestringdatabases
_version_	1718453571649798144

Efficient Disk-Based Techniques for Manipulating Very Large String Databases

Similar Items