Linear time minimum segmentation enables scalable founder reconstruction

Abstract Background  We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as p...

Full description

Bibliographic Details
Main Authors: Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen
Format: Article
Language:English
Published: BMC 2019-05-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13015-019-0147-6
id doaj-f184131ea832445884b76629a936afa6
record_format Article
spelling doaj-f184131ea832445884b76629a936afa62020-11-25T02:25:47ZengBMCAlgorithms for Molecular Biology1748-71882019-05-0114111510.1186/s13015-019-0147-6Linear time minimum segmentation enables scalable founder reconstructionTuukka Norri0Bastien Cazaux1Dmitry Kosolobov2Veli Mäkinen3Department of Computer Science, University of HelsinkiDepartment of Computer Science, University of HelsinkiUral Federal UniversityDepartment of Computer Science, University of HelsinkiAbstract Background  We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set $${\mathcal {R}} = \{R_1, \ldots , R_m\}$$ R={R1,…,Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment $$[a,b] \in P$$ [a,b]∈P has length at least L and the number $$d(a,b)=|\{R_i[a,b] :1\le i \le m\}|$$ d(a,b)=|{Ri[a,b]:1≤i≤m}| of distinct substrings at segment [a, b] is minimized over $$[a,b] \in P$$ [a,b]∈P . The distinct substrings in the segments represent founder blocks that can be concatenated to form $$\max \{ d(a,b) :[a,b] \in P \}$$ max{d(a,b):[a,b]∈P} founder sequences representing the original $${\mathcal {R}}$$ R such that crossovers happen only at segment boundaries. Results  We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier $$O(mn^2)$$ O(mn2) . Conclusions  Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.http://link.springer.com/article/10.1186/s13015-019-0147-6Pan-genome indexingFounder reconstructionDynamic programmingPositional Burrows–Wheeler transformRange minimum query
collection DOAJ
language English
format Article
sources DOAJ
author Tuukka Norri
Bastien Cazaux
Dmitry Kosolobov
Veli Mäkinen
spellingShingle Tuukka Norri
Bastien Cazaux
Dmitry Kosolobov
Veli Mäkinen
Linear time minimum segmentation enables scalable founder reconstruction
Algorithms for Molecular Biology
Pan-genome indexing
Founder reconstruction
Dynamic programming
Positional Burrows–Wheeler transform
Range minimum query
author_facet Tuukka Norri
Bastien Cazaux
Dmitry Kosolobov
Veli Mäkinen
author_sort Tuukka Norri
title Linear time minimum segmentation enables scalable founder reconstruction
title_short Linear time minimum segmentation enables scalable founder reconstruction
title_full Linear time minimum segmentation enables scalable founder reconstruction
title_fullStr Linear time minimum segmentation enables scalable founder reconstruction
title_full_unstemmed Linear time minimum segmentation enables scalable founder reconstruction
title_sort linear time minimum segmentation enables scalable founder reconstruction
publisher BMC
series Algorithms for Molecular Biology
issn 1748-7188
publishDate 2019-05-01
description Abstract Background  We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set $${\mathcal {R}} = \{R_1, \ldots , R_m\}$$ R={R1,…,Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment $$[a,b] \in P$$ [a,b]∈P has length at least L and the number $$d(a,b)=|\{R_i[a,b] :1\le i \le m\}|$$ d(a,b)=|{Ri[a,b]:1≤i≤m}| of distinct substrings at segment [a, b] is minimized over $$[a,b] \in P$$ [a,b]∈P . The distinct substrings in the segments represent founder blocks that can be concatenated to form $$\max \{ d(a,b) :[a,b] \in P \}$$ max{d(a,b):[a,b]∈P} founder sequences representing the original $${\mathcal {R}}$$ R such that crossovers happen only at segment boundaries. Results  We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier $$O(mn^2)$$ O(mn2) . Conclusions  Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.
topic Pan-genome indexing
Founder reconstruction
Dynamic programming
Positional Burrows–Wheeler transform
Range minimum query
url http://link.springer.com/article/10.1186/s13015-019-0147-6
work_keys_str_mv AT tuukkanorri lineartimeminimumsegmentationenablesscalablefounderreconstruction
AT bastiencazaux lineartimeminimumsegmentationenablesscalablefounderreconstruction
AT dmitrykosolobov lineartimeminimumsegmentationenablesscalablefounderreconstruction
AT velimakinen lineartimeminimumsegmentationenablesscalablefounderreconstruction
_version_ 1724850213601935360