Efficient Data Structures for Range Shortest Unique Substring Queries

Let <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></mrow></semantics></mat...

Full description

Bibliographic Details
Main Authors: Paniz Abedin, Arnab Ganguly, Solon P. Pissis, Sharma V. Thankachan
Format: Article
Language:English
Published: MDPI AG 2020-10-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/13/11/276
id doaj-b14b1c6800df4645a05cf5809f528568
record_format Article
collection DOAJ
language English
format Article
sources DOAJ
author Paniz Abedin
Arnab Ganguly
Solon P. Pissis
Sharma V. Thankachan
spellingShingle Paniz Abedin
Arnab Ganguly
Solon P. Pissis
Sharma V. Thankachan
Efficient Data Structures for Range Shortest Unique Substring Queries
Algorithms
shortest unique substring
suffix tree
heavy-light decomposition
range queries
geometric data structures
author_facet Paniz Abedin
Arnab Ganguly
Solon P. Pissis
Sharma V. Thankachan
author_sort Paniz Abedin
title Efficient Data Structures for Range Shortest Unique Substring Queries
title_short Efficient Data Structures for Range Shortest Unique Substring Queries
title_full Efficient Data Structures for Range Shortest Unique Substring Queries
title_fullStr Efficient Data Structures for Range Shortest Unique Substring Queries
title_full_unstemmed Efficient Data Structures for Range Shortest Unique Substring Queries
title_sort efficient data structures for range shortest unique substring queries
publisher MDPI AG
series Algorithms
issn 1999-4893
publishDate 2020-10-01
description Let <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></mrow></semantics></math></inline-formula> be a string of length <i>n</i> and <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> be the substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> starting at position <i>i</i> and ending at position <i>j</i>. A substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> is a repeat if it occurs more than once in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>; otherwise, it is a unique substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> as input, the Shortest Unique Substring problem is to find a shortest substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> that does not occur elsewhere in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> answering the following type of online queries efficiently. Given a range <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>, return a shortest substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> with exactly one occurrence in <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>. We present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msub><mo form="prefix">log</mo><mi>w</mi></msub><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>w</mi><mo>=</mo><mi>Ω</mi><mo>(</mo><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msqrt><mi>n</mi></msqrt><msup><mo form="prefix">log</mo><mi>ϵ</mi></msup><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>ϵ</mi><mo>></mo><mn>0</mn></mrow></semantics></math></inline-formula> is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].
topic shortest unique substring
suffix tree
heavy-light decomposition
range queries
geometric data structures
url https://www.mdpi.com/1999-4893/13/11/276
work_keys_str_mv AT panizabedin efficientdatastructuresforrangeshortestuniquesubstringqueries
AT arnabganguly efficientdatastructuresforrangeshortestuniquesubstringqueries
AT solonppissis efficientdatastructuresforrangeshortestuniquesubstringqueries
AT sharmavthankachan efficientdatastructuresforrangeshortestuniquesubstringqueries
_version_ 1724449391932080128
spelling doaj-b14b1c6800df4645a05cf5809f5285682020-11-25T04:00:46ZengMDPI AGAlgorithms1999-48932020-10-011327627610.3390/a13110276Efficient Data Structures for Range Shortest Unique Substring QueriesPaniz Abedin0Arnab Ganguly1Solon P. Pissis2Sharma V. Thankachan3Department of Computer Science, University of Central Florida, Orlando, FL 32816, USADepartment of Computer Science, University of Wisconsin - Whitewater, Whitewater, WI 53190, USALife Sciences and Health, CWI, 1098 XG Amsterdam, The NetherlandsDepartment of Computer Science, University of Central Florida, Orlando, FL 32816, USALet <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></mrow></semantics></math></inline-formula> be a string of length <i>n</i> and <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> be the substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> starting at position <i>i</i> and ending at position <i>j</i>. A substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> is a repeat if it occurs more than once in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>; otherwise, it is a unique substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> as input, the Shortest Unique Substring problem is to find a shortest substring of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> that does not occur elsewhere in <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula>. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> answering the following type of online queries efficiently. Given a range <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>, return a shortest substring <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="sans-serif">T</mi><mo>[</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo>]</mo></mrow></semantics></math></inline-formula> of <inline-formula><math display="inline"><semantics><mi mathvariant="sans-serif">T</mi></semantics></math></inline-formula> with exactly one occurrence in <inline-formula><math display="inline"><semantics><mrow><mo>[</mo><mi>α</mi><mo>,</mo><mi>β</mi><mo>]</mo></mrow></semantics></math></inline-formula>. We present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msub><mo form="prefix">log</mo><mi>w</mi></msub><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>w</mi><mo>=</mo><mi>Ω</mi><mo>(</mo><mo form="prefix">log</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula>-word data structure with <inline-formula><math display="inline"><semantics><mrow><mi mathvariant="script">O</mi><mo>(</mo><msqrt><mi>n</mi></msqrt><msup><mo form="prefix">log</mo><mi>ϵ</mi></msup><mi>n</mi><mo>)</mo></mrow></semantics></math></inline-formula> query time, where <inline-formula><math display="inline"><semantics><mrow><mi>ϵ</mi><mo>></mo><mn>0</mn></mrow></semantics></math></inline-formula> is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].https://www.mdpi.com/1999-4893/13/11/276shortest unique substringsuffix treeheavy-light decompositionrange queriesgeometric data structures