Efficient Data Structures for Range Shortest Unique Substring Queries

Let <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></mrow></semantics></mat...

Full description

Bibliographic Details
Main Authors:	Paniz Abedin, Arnab Ganguly, Solon P. Pissis, Sharma V. Thankachan
Format:	Article
Language:	English
Published:	MDPI AG 2020-10-01
Series:	Algorithms
Subjects:	shortest unique substring suffix tree heavy-light decomposition range queries geometric data structures
Online Access:	https://www.mdpi.com/1999-4893/13/11/276

id	doaj-b14b1c6800df4645a05cf5809f528568
record_format	Article
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Paniz Abedin Arnab Ganguly Solon P. Pissis Sharma V. Thankachan
spellingShingle	Paniz Abedin Arnab Ganguly Solon P. Pissis Sharma V. Thankachan Efficient Data Structures for Range Shortest Unique Substring Queries Algorithms shortest unique substring suffix tree heavy-light decomposition range queries geometric data structures
author_facet	Paniz Abedin Arnab Ganguly Solon P. Pissis Sharma V. Thankachan
author_sort	Paniz Abedin
title	Efficient Data Structures for Range Shortest Unique Substring Queries
title_short	Efficient Data Structures for Range Shortest Unique Substring Queries
title_full	Efficient Data Structures for Range Shortest Unique Substring Queries
title_fullStr	Efficient Data Structures for Range Shortest Unique Substring Queries
title_full_unstemmed	Efficient Data Structures for Range Shortest Unique Substring Queries
title_sort	efficient data structures for range shortest unique substring queries
publisher	MDPI AG
series	Algorithms
issn	1999-4893
publishDate	2020-10-01
description	Let <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></mrow></semantics></math></inline-formula> be a string of length <i>n</i> and <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> be the substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> starting at position <i>i</i> and ending at position <i>j</i>. A substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> is a repeat if it occurs more than once in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>; otherwise, it is a unique substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> as input, the Shortest Unique Substring problem is to find a shortest substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> that does not occur elsewhere in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> answering the following type of online queries efficiently. Given a range <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>, return a shortest substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> with exactly one occurrence in <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>. We present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msub><mo form="prefix">log</mo><mi>w</mi></msub><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>w</mi><mo>=</mo><mi>Ω</mi><mo>(</mo><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msqrt><mi>n</mi></msqrt><msup><mo form="prefix">log</mo><mi>ϵ</mi></msup><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>ϵ</mi><mo>></mo><mn>0</mn></mrow></semantics></math></inline-formula> is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].
topic	shortest unique substring suffix tree heavy-light decomposition range queries geometric data structures
url	https://www.mdpi.com/1999-4893/13/11/276
work_keys_str_mv	AT panizabedin efficientdatastructuresforrangeshortestuniquesubstringqueries AT arnabganguly efficientdatastructuresforrangeshortestuniquesubstringqueries AT solonppissis efficientdatastructuresforrangeshortestuniquesubstringqueries AT sharmavthankachan efficientdatastructuresforrangeshortestuniquesubstringqueries
_version_	1724449391932080128
spelling	doaj-b14b1c6800df4645a05cf5809f5285682020-11-25T04:00:46ZengMDPI AGAlgorithms1999-48932020-10-011327627610.3390/a13110276Efficient Data Structures for Range Shortest Unique Substring QueriesPaniz Abedin0Arnab Ganguly1Solon P. Pissis2Sharma V. Thankachan3Department of Computer Science, University of Central Florida, Orlando, FL 32816, USADepartment of Computer Science, University of Wisconsin - Whitewater, Whitewater, WI 53190, USALife Sciences and Health, CWI, 1098 XG Amsterdam, The NetherlandsDepartment of Computer Science, University of Central Florida, Orlando, FL 32816, USALet <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></mrow></semantics></math></inline-formula> be a string of length <i>n</i> and <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> be the substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> starting at position <i>i</i> and ending at position <i>j</i>. A substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> is a repeat if it occurs more than once in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>; otherwise, it is a unique substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> as input, the Shortest Unique Substring problem is to find a shortest substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> that does not occur elsewhere in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> answering the following type of online queries efficiently. Given a range <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>, return a shortest substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> with exactly one occurrence in <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>. We present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msub><mo form="prefix">log</mo><mi>w</mi></msub><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>w</mi><mo>=</mo><mi>Ω</mi><mo>(</mo><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msqrt><mi>n</mi></msqrt><msup><mo form="prefix">log</mo><mi>ϵ</mi></msup><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>ϵ</mi><mo>></mo><mn>0</mn></mrow></semantics></math></inline-formula> is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].https://www.mdpi.com/1999-4893/13/11/276shortest unique substringsuffix treeheavy-light decompositionrange queriesgeometric data structures

Efficient Data Structures for Range Shortest Unique Substring Queries

Similar Items