External memory BWT and LCP computation for sequence collections with applications

Abstract Background Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Becaus...

Full description

Bibliographic Details
Main Authors: Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, Guilherme P. Telles
Format: Article
Language:English
Published: BMC 2019-03-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13015-019-0140-0
id doaj-2ec00b4a05644bf9af467fff8f8b094a
record_format Article
spelling doaj-2ec00b4a05644bf9af467fff8f8b094a2020-11-25T02:17:13ZengBMCAlgorithms for Molecular Biology1748-71882019-03-0114111510.1186/s13015-019-0140-0External memory BWT and LCP computation for sequence collections with applicationsLavinia Egidi0Felipe A. Louza1Giovanni Manzini2Guilherme P. Telles3DiSIT, University of Eastern PiedmontDepartment of Computing and Mathematics, University of São PauloDiSIT, University of Eastern PiedmontInstitute of Computing, University of CampinasAbstract Background Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. Results We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. Conclusions We prove that our algorithm performs $${\mathcal {O}}(n\, \mathsf {maxlcp})$$ O(nmaxlcp) sequential I/Os, where n is the total length of the collection and $$\mathsf {maxlcp}$$ maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.http://link.springer.com/article/10.1186/s13015-019-0140-0Burrows–Wheeler TransformLongest common prefix arrayMaximal repeatsAll pairs suffix–prefix overlapsSuccinct de Bruijn graphExternal memory algorithms
collection DOAJ
language English
format Article
sources DOAJ
author Lavinia Egidi
Felipe A. Louza
Giovanni Manzini
Guilherme P. Telles
spellingShingle Lavinia Egidi
Felipe A. Louza
Giovanni Manzini
Guilherme P. Telles
External memory BWT and LCP computation for sequence collections with applications
Algorithms for Molecular Biology
Burrows–Wheeler Transform
Longest common prefix array
Maximal repeats
All pairs suffix–prefix overlaps
Succinct de Bruijn graph
External memory algorithms
author_facet Lavinia Egidi
Felipe A. Louza
Giovanni Manzini
Guilherme P. Telles
author_sort Lavinia Egidi
title External memory BWT and LCP computation for sequence collections with applications
title_short External memory BWT and LCP computation for sequence collections with applications
title_full External memory BWT and LCP computation for sequence collections with applications
title_fullStr External memory BWT and LCP computation for sequence collections with applications
title_full_unstemmed External memory BWT and LCP computation for sequence collections with applications
title_sort external memory bwt and lcp computation for sequence collections with applications
publisher BMC
series Algorithms for Molecular Biology
issn 1748-7188
publishDate 2019-03-01
description Abstract Background Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. Results We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. Conclusions We prove that our algorithm performs $${\mathcal {O}}(n\, \mathsf {maxlcp})$$ O(nmaxlcp) sequential I/Os, where n is the total length of the collection and $$\mathsf {maxlcp}$$ maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.
topic Burrows–Wheeler Transform
Longest common prefix array
Maximal repeats
All pairs suffix–prefix overlaps
Succinct de Bruijn graph
External memory algorithms
url http://link.springer.com/article/10.1186/s13015-019-0140-0
work_keys_str_mv AT laviniaegidi externalmemorybwtandlcpcomputationforsequencecollectionswithapplications
AT felipealouza externalmemorybwtandlcpcomputationforsequencecollectionswithapplications
AT giovannimanzini externalmemorybwtandlcpcomputationforsequencecollectionswithapplications
AT guilhermeptelles externalmemorybwtandlcpcomputationforsequencecollectionswithapplications
_version_ 1724887574194946048