A Fast Algorithm for the Largest Area First Parsing of Real Strings

The largest area first parsing of a string often leads to the best results in grammar compression for a variety of input data. However, the fastest existing algorithm has &#x0398;(N<sup>2</sup> log N) time complexity, which makes it impractical for real-life applications. We present...

Full description

Bibliographic Details
Main Authors: Ivan Katanic, Strahil Ristov, Martin Rosenzweig
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9154361/
id doaj-444e467cea4949a5bb05cfa6fb8e856c
record_format Article
spelling doaj-444e467cea4949a5bb05cfa6fb8e856c2021-03-30T03:45:44ZengIEEEIEEE Access2169-35362020-01-01814199014200210.1109/ACCESS.2020.30136769154361A Fast Algorithm for the Largest Area First Parsing of Real StringsIvan Katanic0https://orcid.org/0000-0001-5293-3396Strahil Ristov1https://orcid.org/0000-0001-6039-0838Martin Rosenzweig2Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, CroatiaRuer Bo&#x0161;kovi&#x0107; Institute, Zagreb, CroatiaDepartment of Mathematics, Technische Universit&#x00E4;t M&#x00FC;nchen, Munich, GermanyThe largest area first parsing of a string often leads to the best results in grammar compression for a variety of input data. However, the fastest existing algorithm has &#x0398;(N<sup>2</sup> log N) time complexity, which makes it impractical for real-life applications. We present a new largest area first parsing method that has O(N<sup>3</sup>) complexity in the improbable worst case but works in the quasilinear time for most practical purposes. This result is based on the fact that in the real data, the sum of all depths of an LCP-interval tree, over all of the positions in a suffix array of an input string, is only larger than the size of the input by a small factor &#x03B1;. We present the analysis of the algorithm in terms of &#x03B1;, and the experimental results confirm that our method is practical even for genome sized inputs. We provide the C++11 code for the implementation of our method. Additionally, we show that by a combination of the previous and new algorithms, the worst-case complexity of the largest area first parsing is improved by a factor of <sup>3</sup>&#x221A;N.https://ieeexplore.ieee.org/document/9154361/Greedy grammar compressionlargest area first parsingdynamic text indexingenhanced suffix array
collection DOAJ
language English
format Article
sources DOAJ
author Ivan Katanic
Strahil Ristov
Martin Rosenzweig
spellingShingle Ivan Katanic
Strahil Ristov
Martin Rosenzweig
A Fast Algorithm for the Largest Area First Parsing of Real Strings
IEEE Access
Greedy grammar compression
largest area first parsing
dynamic text indexing
enhanced suffix array
author_facet Ivan Katanic
Strahil Ristov
Martin Rosenzweig
author_sort Ivan Katanic
title A Fast Algorithm for the Largest Area First Parsing of Real Strings
title_short A Fast Algorithm for the Largest Area First Parsing of Real Strings
title_full A Fast Algorithm for the Largest Area First Parsing of Real Strings
title_fullStr A Fast Algorithm for the Largest Area First Parsing of Real Strings
title_full_unstemmed A Fast Algorithm for the Largest Area First Parsing of Real Strings
title_sort fast algorithm for the largest area first parsing of real strings
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description The largest area first parsing of a string often leads to the best results in grammar compression for a variety of input data. However, the fastest existing algorithm has &#x0398;(N<sup>2</sup> log N) time complexity, which makes it impractical for real-life applications. We present a new largest area first parsing method that has O(N<sup>3</sup>) complexity in the improbable worst case but works in the quasilinear time for most practical purposes. This result is based on the fact that in the real data, the sum of all depths of an LCP-interval tree, over all of the positions in a suffix array of an input string, is only larger than the size of the input by a small factor &#x03B1;. We present the analysis of the algorithm in terms of &#x03B1;, and the experimental results confirm that our method is practical even for genome sized inputs. We provide the C++11 code for the implementation of our method. Additionally, we show that by a combination of the previous and new algorithms, the worst-case complexity of the largest area first parsing is improved by a factor of <sup>3</sup>&#x221A;N.
topic Greedy grammar compression
largest area first parsing
dynamic text indexing
enhanced suffix array
url https://ieeexplore.ieee.org/document/9154361/
work_keys_str_mv AT ivankatanic afastalgorithmforthelargestareafirstparsingofrealstrings
AT strahilristov afastalgorithmforthelargestareafirstparsingofrealstrings
AT martinrosenzweig afastalgorithmforthelargestareafirstparsingofrealstrings
AT ivankatanic fastalgorithmforthelargestareafirstparsingofrealstrings
AT strahilristov fastalgorithmforthelargestareafirstparsingofrealstrings
AT martinrosenzweig fastalgorithmforthelargestareafirstparsingofrealstrings
_version_ 1724182829994082304