Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and...

Full description

Bibliographic Details
Main Authors:	Bafna, V. (Author), Balaban, M. (Author), Mirarab, S. (Author), Rachtman, E. (Author), Sarmashghi, S. (Author), Touri, B. (Author)
Format:	Article
Language:	English
Published:	Public Library of Science 2021
Subjects:	algorithm Algorithms animal Animals article biological model biology classification Computational Biology computer simulation Computer Simulation Databases, Genetic genetic database genetics genome Genome genomics Genomics human Humans invertebrate Invertebrates least square analysis Least-Squares Analysis Linear Models mammal Mammals Models, Genetic nucleotide repeat phylogeny Phylogeny plant Plants Repetitive Sequences, Nucleic Acid sequence analysis simulation software Software statistical model system analysis theoretical study vertebrate Vertebrates
Online Access:	View Fulltext in Publisher


LEADER	03926nam a2200721Ia 4500
001	10.1371-journal.pcbi.1009449
008	220427s2021 CNT 000 0 und d
020			\|a 1553734X (ISSN)
245	1	0	\|a Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT
260		0	\|b Public Library of Science \|c 2021
856			\|z View Fulltext in Publisher \|u https://doi.org/10.1371/journal.pcbi.1009449
520	3		\|a The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had < 1.5% error in length estimation compared to 34% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=. © 2021 Sarmashghi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
650	0	4	\|a algorithm
650	0	4	\|a Algorithms
650	0	4	\|a animal
650	0	4	\|a Animals
650	0	4	\|a article
650	0	4	\|a biological model
650	0	4	\|a biology
650	0	4	\|a classification
650	0	4	\|a Computational Biology
650	0	4	\|a computer simulation
650	0	4	\|a Computer Simulation
650	0	4	\|a Databases, Genetic
650	0	4	\|a genetic database
650	0	4	\|a genetics
650	0	4	\|a genome
650	0	4	\|a Genome
650	0	4	\|a genomics
650	0	4	\|a Genomics
650	0	4	\|a human
650	0	4	\|a Humans
650	0	4	\|a invertebrate
650	0	4	\|a Invertebrates
650	0	4	\|a least square analysis
650	0	4	\|a Least-Squares Analysis
650	0	4	\|a Linear Models
650	0	4	\|a mammal
650	0	4	\|a Mammals
650	0	4	\|a Models, Genetic
650	0	4	\|a nucleotide repeat
650	0	4	\|a phylogeny
650	0	4	\|a Phylogeny
650	0	4	\|a plant
650	0	4	\|a Plants
650	0	4	\|a Repetitive Sequences, Nucleic Acid
650	0	4	\|a sequence analysis
650	0	4	\|a simulation
650	0	4	\|a software
650	0	4	\|a software
650	0	4	\|a Software
650	0	4	\|a statistical model
650	0	4	\|a system analysis
650	0	4	\|a theoretical study
650	0	4	\|a vertebrate
650	0	4	\|a Vertebrates
700	1		\|a Bafna, V. \|e author
700	1		\|a Balaban, M. \|e author
700	1		\|a Mirarab, S. \|e author
700	1		\|a Rachtman, E. \|e author
700	1		\|a Sarmashghi, S. \|e author
700	1		\|a Touri, B. \|e author
773			\|t PLoS Computational Biology

Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

Similar Items