Mash Screen: high-throughput sequence containment estimation for genome discovery

Abstract The MinHash algorithm has proven effective for rapidly estimating the resemblance of two genomes or metagenomes. However, this method cannot reliably estimate the containment of a genome within a metagenome. Here, we describe an online algorithm capable of measuring the containment of genom...

Full description

Bibliographic Details
Main Authors: Brian D. Ondov, Gabriel J. Starrett, Anna Sappington, Aleksandra Kostic, Sergey Koren, Christopher B. Buck, Adam M. Phillippy
Format: Article
Language:English
Published: BMC 2019-11-01
Series:Genome Biology
Subjects:
SRA
Online Access:http://link.springer.com/article/10.1186/s13059-019-1841-x
id doaj-71114a92c58f48dcb8444e3159132ef3
record_format Article
spelling doaj-71114a92c58f48dcb8444e3159132ef32020-11-25T04:07:37ZengBMCGenome Biology1474-760X2019-11-0120111310.1186/s13059-019-1841-xMash Screen: high-throughput sequence containment estimation for genome discoveryBrian D. Ondov0Gabriel J. Starrett1Anna Sappington2Aleksandra Kostic3Sergey Koren4Christopher B. Buck5Adam M. Phillippy6Genome Informatics section, National Human Genome Research InstituteTumor Virus Molecular Biology section, National Cancer InstituteDepartment of Electrical Engineering and Computer Science, Massachusetts Institute of TechnologyDepartment of Computer Science, Princeton UniversityGenome Informatics section, National Human Genome Research InstituteTumor Virus Molecular Biology section, National Cancer InstituteGenome Informatics section, National Human Genome Research InstituteAbstract The MinHash algorithm has proven effective for rapidly estimating the resemblance of two genomes or metagenomes. However, this method cannot reliably estimate the containment of a genome within a metagenome. Here, we describe an online algorithm capable of measuring the containment of genomes and proteomes within either assembled or unassembled sequencing read sets. We describe several use cases, including contamination screening and retrospective analysis of metagenomes for novel genome discovery. Using this tool, we provide containment estimates for every NCBI RefSeq genome within every SRA metagenome and demonstrate the identification of a novel polyomavirus species from a public metagenome.http://link.springer.com/article/10.1186/s13059-019-1841-xMinHashMetagenomicsSequencingSRAViral DiscoveryPolyomavirus
collection DOAJ
language English
format Article
sources DOAJ
author Brian D. Ondov
Gabriel J. Starrett
Anna Sappington
Aleksandra Kostic
Sergey Koren
Christopher B. Buck
Adam M. Phillippy
spellingShingle Brian D. Ondov
Gabriel J. Starrett
Anna Sappington
Aleksandra Kostic
Sergey Koren
Christopher B. Buck
Adam M. Phillippy
Mash Screen: high-throughput sequence containment estimation for genome discovery
Genome Biology
MinHash
Metagenomics
Sequencing
SRA
Viral Discovery
Polyomavirus
author_facet Brian D. Ondov
Gabriel J. Starrett
Anna Sappington
Aleksandra Kostic
Sergey Koren
Christopher B. Buck
Adam M. Phillippy
author_sort Brian D. Ondov
title Mash Screen: high-throughput sequence containment estimation for genome discovery
title_short Mash Screen: high-throughput sequence containment estimation for genome discovery
title_full Mash Screen: high-throughput sequence containment estimation for genome discovery
title_fullStr Mash Screen: high-throughput sequence containment estimation for genome discovery
title_full_unstemmed Mash Screen: high-throughput sequence containment estimation for genome discovery
title_sort mash screen: high-throughput sequence containment estimation for genome discovery
publisher BMC
series Genome Biology
issn 1474-760X
publishDate 2019-11-01
description Abstract The MinHash algorithm has proven effective for rapidly estimating the resemblance of two genomes or metagenomes. However, this method cannot reliably estimate the containment of a genome within a metagenome. Here, we describe an online algorithm capable of measuring the containment of genomes and proteomes within either assembled or unassembled sequencing read sets. We describe several use cases, including contamination screening and retrospective analysis of metagenomes for novel genome discovery. Using this tool, we provide containment estimates for every NCBI RefSeq genome within every SRA metagenome and demonstrate the identification of a novel polyomavirus species from a public metagenome.
topic MinHash
Metagenomics
Sequencing
SRA
Viral Discovery
Polyomavirus
url http://link.springer.com/article/10.1186/s13059-019-1841-x
work_keys_str_mv AT briandondov mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT gabrieljstarrett mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT annasappington mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT aleksandrakostic mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT sergeykoren mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT christopherbbuck mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
AT adammphillippy mashscreenhighthroughputsequencecontainmentestimationforgenomediscovery
_version_ 1724428093542629376