Approximate String Matching with Compressed Indexes

A compressed full-text self-index for a text T is a data structure requiring reduced space and able to search for patterns P in T. It can also reproduce any substring of T, thus actually replacing T. Despite the recent explosion of interest on compressed indexes, there has not been much progress on...

Full description

Bibliographic Details
Main Authors:	Pedro Morales, Arlindo L. Oliveira, Luís M. S. Russo, Gonzalo Navarro
Format:	Article
Language:	English
Published:	MDPI AG 2009-09-01
Series:	Algorithms
Subjects:	compressed index approximate string matching Lempel-Ziv compressed suffix tree compressed suffix array
Online Access:	http://www.mdpi.com/1999-4893/2/3/1105/

id	doaj-97d84b220c1846c0bdefb74fd5c73b7b
record_format	Article
spelling	doaj-97d84b220c1846c0bdefb74fd5c73b7b2020-11-25T00:18:34ZengMDPI AGAlgorithms1999-48932009-09-01231105113610.3390/a2031105Approximate String Matching with Compressed IndexesPedro MoralesArlindo L. OliveiraLuís M. S. RussoGonzalo NavarroA compressed full-text self-index for a text T is a data structure requiring reduced space and able to search for patterns P in T. It can also reproduce any substring of T, thus actually replacing T. Despite the recent explosion of interest on compressed indexes, there has not been much progress on functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in bioinformatics. We study ASM algorithms for Lempel-Ziv compressed indexes and for compressed suffix trees/arrays. Most compressed self-indexes belong to one of these classes. We start by adapting the classical method of partitioning into exact search to self-indexes, and optimize it over a representative of either class of self-index. Then, we show that a Lempel- Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to a Lempel- Ziv index. Finally, we improve hierarchical verification, a successful technique for sequential searching, so as to extend the matches of pattern pieces to the left or right. Most compressed suffix trees/arrays support the required bidirectionality, thus enabling the implementation of the improved technique. In turn, the improved verification largely reduces the accesses to the text, which are expensive in self-indexes. We show experimentally that our algorithms are competitive and provide useful space-time tradeoffs compared to classical indexes. http://www.mdpi.com/1999-4893/2/3/1105/compressed indexapproximate string matchingLempel-Zivcompressed suffix treecompressed suffix array
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Pedro Morales Arlindo L. Oliveira Luís M. S. Russo Gonzalo Navarro
spellingShingle	Pedro Morales Arlindo L. Oliveira Luís M. S. Russo Gonzalo Navarro Approximate String Matching with Compressed Indexes Algorithms compressed index approximate string matching Lempel-Ziv compressed suffix tree compressed suffix array
author_facet	Pedro Morales Arlindo L. Oliveira Luís M. S. Russo Gonzalo Navarro
author_sort	Pedro Morales
title	Approximate String Matching with Compressed Indexes
title_short	Approximate String Matching with Compressed Indexes
title_full	Approximate String Matching with Compressed Indexes
title_fullStr	Approximate String Matching with Compressed Indexes
title_full_unstemmed	Approximate String Matching with Compressed Indexes
title_sort	approximate string matching with compressed indexes
publisher	MDPI AG
series	Algorithms
issn	1999-4893
publishDate	2009-09-01
description	A compressed full-text self-index for a text T is a data structure requiring reduced space and able to search for patterns P in T. It can also reproduce any substring of T, thus actually replacing T. Despite the recent explosion of interest on compressed indexes, there has not been much progress on functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in bioinformatics. We study ASM algorithms for Lempel-Ziv compressed indexes and for compressed suffix trees/arrays. Most compressed self-indexes belong to one of these classes. We start by adapting the classical method of partitioning into exact search to self-indexes, and optimize it over a representative of either class of self-index. Then, we show that a Lempel- Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to a Lempel- Ziv index. Finally, we improve hierarchical verification, a successful technique for sequential searching, so as to extend the matches of pattern pieces to the left or right. Most compressed suffix trees/arrays support the required bidirectionality, thus enabling the implementation of the improved technique. In turn, the improved verification largely reduces the accesses to the text, which are expensive in self-indexes. We show experimentally that our algorithms are competitive and provide useful space-time tradeoffs compared to classical indexes.
topic	compressed index approximate string matching Lempel-Ziv compressed suffix tree compressed suffix array
url	http://www.mdpi.com/1999-4893/2/3/1105/
work_keys_str_mv	AT pedromorales approximatestringmatchingwithcompressedindexes AT arlindololiveira approximatestringmatchingwithcompressedindexes AT luismsrusso approximatestringmatchingwithcompressedindexes AT gonzalonavarro approximatestringmatchingwithcompressedindexes
_version_	1725375751718436864

Approximate String Matching with Compressed Indexes

Similar Items