A Fast Algorithm for the Largest Area First Parsing of Real Strings
The largest area first parsing of a string often leads to the best results in grammar compression for a variety of input data. However, the fastest existing algorithm has Θ(N<sup>2</sup> log N) time complexity, which makes it impractical for real-life applications. We present...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2020-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9154361/ |
id |
doaj-444e467cea4949a5bb05cfa6fb8e856c |
---|---|
record_format |
Article |
spelling |
doaj-444e467cea4949a5bb05cfa6fb8e856c2021-03-30T03:45:44ZengIEEEIEEE Access2169-35362020-01-01814199014200210.1109/ACCESS.2020.30136769154361A Fast Algorithm for the Largest Area First Parsing of Real StringsIvan Katanic0https://orcid.org/0000-0001-5293-3396Strahil Ristov1https://orcid.org/0000-0001-6039-0838Martin Rosenzweig2Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, CroatiaRuer Bošković Institute, Zagreb, CroatiaDepartment of Mathematics, Technische Universität München, Munich, GermanyThe largest area first parsing of a string often leads to the best results in grammar compression for a variety of input data. However, the fastest existing algorithm has Θ(N<sup>2</sup> log N) time complexity, which makes it impractical for real-life applications. We present a new largest area first parsing method that has O(N<sup>3</sup>) complexity in the improbable worst case but works in the quasilinear time for most practical purposes. This result is based on the fact that in the real data, the sum of all depths of an LCP-interval tree, over all of the positions in a suffix array of an input string, is only larger than the size of the input by a small factor α. We present the analysis of the algorithm in terms of α, and the experimental results confirm that our method is practical even for genome sized inputs. We provide the C++11 code for the implementation of our method. Additionally, we show that by a combination of the previous and new algorithms, the worst-case complexity of the largest area first parsing is improved by a factor of <sup>3</sup>√N.https://ieeexplore.ieee.org/document/9154361/Greedy grammar compressionlargest area first parsingdynamic text indexingenhanced suffix array |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Ivan Katanic Strahil Ristov Martin Rosenzweig |
spellingShingle |
Ivan Katanic Strahil Ristov Martin Rosenzweig A Fast Algorithm for the Largest Area First Parsing of Real Strings IEEE Access Greedy grammar compression largest area first parsing dynamic text indexing enhanced suffix array |
author_facet |
Ivan Katanic Strahil Ristov Martin Rosenzweig |
author_sort |
Ivan Katanic |
title |
A Fast Algorithm for the Largest Area First Parsing of Real Strings |
title_short |
A Fast Algorithm for the Largest Area First Parsing of Real Strings |
title_full |
A Fast Algorithm for the Largest Area First Parsing of Real Strings |
title_fullStr |
A Fast Algorithm for the Largest Area First Parsing of Real Strings |
title_full_unstemmed |
A Fast Algorithm for the Largest Area First Parsing of Real Strings |
title_sort |
fast algorithm for the largest area first parsing of real strings |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2020-01-01 |
description |
The largest area first parsing of a string often leads to the best results in grammar compression for a variety of input data. However, the fastest existing algorithm has Θ(N<sup>2</sup> log N) time complexity, which makes it impractical for real-life applications. We present a new largest area first parsing method that has O(N<sup>3</sup>) complexity in the improbable worst case but works in the quasilinear time for most practical purposes. This result is based on the fact that in the real data, the sum of all depths of an LCP-interval tree, over all of the positions in a suffix array of an input string, is only larger than the size of the input by a small factor α. We present the analysis of the algorithm in terms of α, and the experimental results confirm that our method is practical even for genome sized inputs. We provide the C++11 code for the implementation of our method. Additionally, we show that by a combination of the previous and new algorithms, the worst-case complexity of the largest area first parsing is improved by a factor of <sup>3</sup>√N. |
topic |
Greedy grammar compression largest area first parsing dynamic text indexing enhanced suffix array |
url |
https://ieeexplore.ieee.org/document/9154361/ |
work_keys_str_mv |
AT ivankatanic afastalgorithmforthelargestareafirstparsingofrealstrings AT strahilristov afastalgorithmforthelargestareafirstparsingofrealstrings AT martinrosenzweig afastalgorithmforthelargestareafirstparsingofrealstrings AT ivankatanic fastalgorithmforthelargestareafirstparsingofrealstrings AT strahilristov fastalgorithmforthelargestareafirstparsingofrealstrings AT martinrosenzweig fastalgorithmforthelargestareafirstparsingofrealstrings |
_version_ |
1724182829994082304 |