Prefix-free parsing for building big BWTs
Abstract High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2019-05-01
|
Series: | Algorithms for Molecular Biology |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s13015-019-0148-5 |
id |
doaj-0740dbfd57f5443c975843203a7b99e9 |
---|---|
record_format |
Article |
spelling |
doaj-0740dbfd57f5443c975843203a7b99e92020-11-25T03:21:55ZengBMCAlgorithms for Molecular Biology1748-71882019-05-0114111510.1186/s13015-019-0148-5Prefix-free parsing for building big BWTsChristina Boucher0Travis Gagie1Alan Kuhnle2Ben Langmead3Giovanni Manzini4Taher Mun5CISE, University of FloridaEIT, Diego Portales UniversityCISE, University of FloridaJohns Hopkins UniversityUniversity of Eastern PiedmontJohns Hopkins UniversityAbstract High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.http://link.springer.com/article/10.1186/s13015-019-0148-5Burrows-Wheeler TransformPrefix-free parsingCompression-aware algorithmsGenomic databases |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Christina Boucher Travis Gagie Alan Kuhnle Ben Langmead Giovanni Manzini Taher Mun |
spellingShingle |
Christina Boucher Travis Gagie Alan Kuhnle Ben Langmead Giovanni Manzini Taher Mun Prefix-free parsing for building big BWTs Algorithms for Molecular Biology Burrows-Wheeler Transform Prefix-free parsing Compression-aware algorithms Genomic databases |
author_facet |
Christina Boucher Travis Gagie Alan Kuhnle Ben Langmead Giovanni Manzini Taher Mun |
author_sort |
Christina Boucher |
title |
Prefix-free parsing for building big BWTs |
title_short |
Prefix-free parsing for building big BWTs |
title_full |
Prefix-free parsing for building big BWTs |
title_fullStr |
Prefix-free parsing for building big BWTs |
title_full_unstemmed |
Prefix-free parsing for building big BWTs |
title_sort |
prefix-free parsing for building big bwts |
publisher |
BMC |
series |
Algorithms for Molecular Biology |
issn |
1748-7188 |
publishDate |
2019-05-01 |
description |
Abstract High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory. |
topic |
Burrows-Wheeler Transform Prefix-free parsing Compression-aware algorithms Genomic databases |
url |
http://link.springer.com/article/10.1186/s13015-019-0148-5 |
work_keys_str_mv |
AT christinaboucher prefixfreeparsingforbuildingbigbwts AT travisgagie prefixfreeparsingforbuildingbigbwts AT alankuhnle prefixfreeparsingforbuildingbigbwts AT benlangmead prefixfreeparsingforbuildingbigbwts AT giovannimanzini prefixfreeparsingforbuildingbigbwts AT tahermun prefixfreeparsingforbuildingbigbwts |
_version_ |
1724612423629930496 |