Comparing de novo genome assembly: the long and short of it.

Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implemen...

Full description

Bibliographic Details
Main Authors: Giuseppe Narzisi, Bud Mishra
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2011-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC3084767?pdf=render
id doaj-321eea85dd4c4da0924b03c469adbe3c
record_format Article
spelling doaj-321eea85dd4c4da0924b03c469adbe3c2020-11-25T02:32:11ZengPublic Library of Science (PLoS)PLoS ONE1932-62032011-01-0164e1917510.1371/journal.pone.0019175Comparing de novo genome assembly: the long and short of it.Giuseppe NarzisiBud MishraRecent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers--both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies--are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing "next-generation" assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium.http://europepmc.org/articles/PMC3084767?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Giuseppe Narzisi
Bud Mishra
spellingShingle Giuseppe Narzisi
Bud Mishra
Comparing de novo genome assembly: the long and short of it.
PLoS ONE
author_facet Giuseppe Narzisi
Bud Mishra
author_sort Giuseppe Narzisi
title Comparing de novo genome assembly: the long and short of it.
title_short Comparing de novo genome assembly: the long and short of it.
title_full Comparing de novo genome assembly: the long and short of it.
title_fullStr Comparing de novo genome assembly: the long and short of it.
title_full_unstemmed Comparing de novo genome assembly: the long and short of it.
title_sort comparing de novo genome assembly: the long and short of it.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2011-01-01
description Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers--both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies--are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing "next-generation" assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium.
url http://europepmc.org/articles/PMC3084767?pdf=render
work_keys_str_mv AT giuseppenarzisi comparingdenovogenomeassemblythelongandshortofit
AT budmishra comparingdenovogenomeassemblythelongandshortofit
_version_ 1724820822195961856