Prefix-free parsing for building big BWTs

Abstract High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these...

Full description

Bibliographic Details
Main Authors:	Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, Taher Mun
Format:	Article
Language:	English
Published:	BMC 2019-05-01
Series:	Algorithms for Molecular Biology
Subjects:	Burrows-Wheeler Transform Prefix-free parsing Compression-aware algorithms Genomic databases
Online Access:	http://link.springer.com/article/10.1186/s13015-019-0148-5

id	doaj-0740dbfd57f5443c975843203a7b99e9
record_format	Article
spelling	doaj-0740dbfd57f5443c975843203a7b99e92020-11-25T03:21:55ZengBMCAlgorithms for Molecular Biology1748-71882019-05-0114111510.1186/s13015-019-0148-5Prefix-free parsing for building big BWTsChristina Boucher0Travis Gagie1Alan Kuhnle2Ben Langmead3Giovanni Manzini4Taher Mun5CISE, University of FloridaEIT, Diego Portales UniversityCISE, University of FloridaJohns Hopkins UniversityUniversity of Eastern PiedmontJohns Hopkins UniversityAbstract High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(\|T\|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.http://link.springer.com/article/10.1186/s13015-019-0148-5Burrows-Wheeler TransformPrefix-free parsingCompression-aware algorithmsGenomic databases
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Christina Boucher Travis Gagie Alan Kuhnle Ben Langmead Giovanni Manzini Taher Mun
spellingShingle	Christina Boucher Travis Gagie Alan Kuhnle Ben Langmead Giovanni Manzini Taher Mun Prefix-free parsing for building big BWTs Algorithms for Molecular Biology Burrows-Wheeler Transform Prefix-free parsing Compression-aware algorithms Genomic databases
author_facet	Christina Boucher Travis Gagie Alan Kuhnle Ben Langmead Giovanni Manzini Taher Mun
author_sort	Christina Boucher
title	Prefix-free parsing for building big BWTs
title_short	Prefix-free parsing for building big BWTs
title_full	Prefix-free parsing for building big BWTs
title_fullStr	Prefix-free parsing for building big BWTs
title_full_unstemmed	Prefix-free parsing for building big BWTs
title_sort	prefix-free parsing for building big bwts
publisher	BMC
series	Algorithms for Molecular Biology
issn	1748-7188
publishDate	2019-05-01
description	Abstract High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(\|T\|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.
topic	Burrows-Wheeler Transform Prefix-free parsing Compression-aware algorithms Genomic databases
url	http://link.springer.com/article/10.1186/s13015-019-0148-5
work_keys_str_mv	AT christinaboucher prefixfreeparsingforbuildingbigbwts AT travisgagie prefixfreeparsingforbuildingbigbwts AT alankuhnle prefixfreeparsingforbuildingbigbwts AT benlangmead prefixfreeparsingforbuildingbigbwts AT giovannimanzini prefixfreeparsingforbuildingbigbwts AT tahermun prefixfreeparsingforbuildingbigbwts
_version_	1724612423629930496

Prefix-free parsing for building big BWTs

Similar Items