Efficient Data Structures for Range Shortest Unique Substring Queries
Let <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></mrow></semantics></mat...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-10-01
|
Series: | Algorithms |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-4893/13/11/276 |
id |
doaj-b14b1c6800df4645a05cf5809f528568 |
---|---|
record_format |
Article |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Paniz Abedin Arnab Ganguly Solon P. Pissis Sharma V. Thankachan |
spellingShingle |
Paniz Abedin Arnab Ganguly Solon P. Pissis Sharma V. Thankachan Efficient Data Structures for Range Shortest Unique Substring Queries Algorithms shortest unique substring suffix tree heavy-light decomposition range queries geometric data structures |
author_facet |
Paniz Abedin Arnab Ganguly Solon P. Pissis Sharma V. Thankachan |
author_sort |
Paniz Abedin |
title |
Efficient Data Structures for Range Shortest Unique Substring Queries |
title_short |
Efficient Data Structures for Range Shortest Unique Substring Queries |
title_full |
Efficient Data Structures for Range Shortest Unique Substring Queries |
title_fullStr |
Efficient Data Structures for Range Shortest Unique Substring Queries |
title_full_unstemmed |
Efficient Data Structures for Range Shortest Unique Substring Queries |
title_sort |
efficient data structures for range shortest unique substring queries |
publisher |
MDPI AG |
series |
Algorithms |
issn |
1999-4893 |
publishDate |
2020-10-01 |
description |
Let <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></mrow></semantics></math></inline-formula> be a string of length <i>n</i> and <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> be the substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> starting at position <i>i</i> and ending at position <i>j</i>. A substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> is a repeat if it occurs more than once in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>; otherwise, it is a unique substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> as input, the Shortest Unique Substring problem is to find a shortest substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> that does not occur elsewhere in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> answering the following type of online queries efficiently. Given a range <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>, return a shortest substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> with exactly one occurrence in <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>. We present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msub><mo form="prefix">log</mo><mi>w</mi></msub><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>w</mi><mo>=</mo><mi>Ω</mi><mo>(</mo><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msqrt><mi>n</mi></msqrt><msup><mo form="prefix">log</mo><mi>ϵ</mi></msup><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>ϵ</mi><mo>></mo><mn>0</mn></mrow></semantics></math></inline-formula> is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012]. |
topic |
shortest unique substring suffix tree heavy-light decomposition range queries geometric data structures |
url |
https://www.mdpi.com/1999-4893/13/11/276 |
work_keys_str_mv |
AT panizabedin efficientdatastructuresforrangeshortestuniquesubstringqueries AT arnabganguly efficientdatastructuresforrangeshortestuniquesubstringqueries AT solonppissis efficientdatastructuresforrangeshortestuniquesubstringqueries AT sharmavthankachan efficientdatastructuresforrangeshortestuniquesubstringqueries |
_version_ |
1724449391932080128 |
spelling |
doaj-b14b1c6800df4645a05cf5809f5285682020-11-25T04:00:46ZengMDPI AGAlgorithms1999-48932020-10-011327627610.3390/a13110276Efficient Data Structures for Range Shortest Unique Substring QueriesPaniz Abedin0Arnab Ganguly1Solon P. Pissis2Sharma V. Thankachan3Department of Computer Science, University of Central Florida, Orlando, FL 32816, USADepartment of Computer Science, University of Wisconsin - Whitewater, Whitewater, WI 53190, USALife Sciences and Health, CWI, 1098 XG Amsterdam, The NetherlandsDepartment of Computer Science, University of Central Florida, Orlando, FL 32816, USALet <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></mrow></semantics></math></inline-formula> be a string of length <i>n</i> and <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> be the substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> starting at position <i>i</i> and ending at position <i>j</i>. A substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> is a repeat if it occurs more than once in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>; otherwise, it is a unique substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> as input, the Shortest Unique Substring problem is to find a shortest substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> that does not occur elsewhere in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> answering the following type of online queries efficiently. Given a range <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>, return a shortest substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> with exactly one occurrence in <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>. We present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msub><mo form="prefix">log</mo><mi>w</mi></msub><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>w</mi><mo>=</mo><mi>Ω</mi><mo>(</mo><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msqrt><mi>n</mi></msqrt><msup><mo form="prefix">log</mo><mi>ϵ</mi></msup><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>ϵ</mi><mo>></mo><mn>0</mn></mrow></semantics></math></inline-formula> is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].https://www.mdpi.com/1999-4893/13/11/276shortest unique substringsuffix treeheavy-light decompositionrange queriesgeometric data structures |