Text Indexing for Regular Expression Matching

Finding substrings of a text <i>T</i> that match a regular expression <i>p</i> is a fundamental problem. Despite being the subject of extensive research, no solution with a time complexity significantly better than <inline-formula><math xmlns="http://www.w3.org/...

Full description

Bibliographic Details
Main Authors: Daniel Gibney, Sharma V. Thankachan
Format: Article
Language:English
Published: MDPI AG 2021-04-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/14/5/133
Description
Summary:Finding substrings of a text <i>T</i> that match a regular expression <i>p</i> is a fundamental problem. Despite being the subject of extensive research, no solution with a time complexity significantly better than <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><mo>|</mo><mi>T</mi><mo>|</mo><mo>|</mo><mi>p</mi><mo>|</mo><mo>)</mo></mrow></semantics></math></inline-formula> has been found. Backurs and Indyk in FOCS 2016 established conditional lower bounds for the algorithmic problem based on the Strong Exponential Time Hypothesis that helps explain this difficulty. A natural question is whether we can improve the time complexity for matching the regular expression by preprocessing the text <i>T</i>? We show that conditioned on the Online Matrix–Vector Multiplication (OMv) conjecture, even with arbitrary polynomial preprocessing time, a regular expression query on a text cannot be answered in strongly sublinear time, i.e., <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><mo>|</mo><mi>T</mi><msup><mo>|</mo><mrow><mn>1</mn><mo>−</mo><mi>ε</mi></mrow></msup><mo>)</mo></mrow></semantics></math></inline-formula> for any <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>ε</mi><mo>></mo><mn>0</mn></mrow></semantics></math></inline-formula>. Furthermore, if we extend the OMv conjecture to a plausible conjecture regarding Boolean matrix multiplication with polynomial preprocessing time, which we call Online Matrix–Matrix Multiplication (OMM), we can strengthen this hardness result to there being no solution with a query time that is <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><mo>|</mo><mi>T</mi><msup><mo>|</mo><mrow><mn>3</mn><mo>/</mo><mn>2</mn><mo>−</mo><mi>ε</mi></mrow></msup><mo>)</mo></mrow></semantics></math></inline-formula>. These results hold for alphabet sizes three or greater. We then provide data structures that answer queries in <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>O</mi><mo>(</mo><mfrac><mrow><mo>|</mo><mi>T</mi><mo>|</mo><mo>|</mo><mi>p</mi><mo>|</mo></mrow><mi>τ</mi></mfrac><mo>)</mo></mrow></semantics></math></inline-formula> time where <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>τ</mi><mo>∈</mo><mo>[</mo><mn>1</mn><mo>,</mo><mo>|</mo><mi>T</mi><mo>|</mo><mo>]</mo></mrow></semantics></math></inline-formula> is fixed at construction. These include a solution that works for all regular expressions with <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo form="prefix">Exp</mo><mfenced separators="" open="(" close=")"><mi>τ</mi><mo>·</mo><mo>|</mo><mi>T</mi><mo>|</mo></mfenced></mrow></semantics></math></inline-formula> preprocessing time and space. For patterns containing only ‘concatenation’ and ‘or’ operators (the same type used in the hardness result), we provide (1) a deterministic solution which requires <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo form="prefix">Exp</mo><mfenced separators="" open="(" close=")"><mi>τ</mi><mo>·</mo><mfrac><mrow><mo>|</mo><mi>T</mi><mo>|</mo></mrow><mrow><msup><mo form="prefix">log</mo><mn>2</mn></msup><mrow><mo>|</mo><mi>T</mi><mo>|</mo></mrow></mrow></mfrac></mfenced></mrow></semantics></math></inline-formula> preprocessing time and space, and (2) when <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mrow><mo>|</mo><mi>p</mi><mo>|</mo></mrow><mo>≤</mo><msup><mrow><mo>|</mo><mi>T</mi><mo>|</mo></mrow><mi>z</mi></msup></mrow></semantics></math></inline-formula> for <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>z</mi><mo>=</mo><msup><mn>2</mn><mrow><mi>o</mi><mo>(</mo><msqrt><mrow><mo form="prefix">log</mo><mo>|</mo><mi>T</mi><mo>|</mo></mrow></msqrt><mo>)</mo></mrow></msup></mrow></semantics></math></inline-formula>, a randomized solution with amortized query time which answers queries correctly with high probability, requiring <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mo form="prefix">Exp</mo><mfenced separators="" open="(" close=")"><mi>τ</mi><mo>·</mo><mfrac><mrow><mo>|</mo><mi>T</mi><mo>|</mo></mrow><msup><mn>2</mn><mrow><mo>Ω</mo><msqrt><mrow><mo form="prefix">log</mo><mo>|</mo><mi>T</mi><mo>|</mo></mrow></msqrt></mrow></msup></mfrac></mfenced></mrow></semantics></math></inline-formula> preprocessing time and space.
ISSN:1999-4893