Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence Assemblies

Illumina sequencing has revolutionized yeast genomics, with prices for commercial draft genome sequencing now below $200. The popular SPAdes assembler makes it simple to generate a de novo genome assembly for any yeast species. However, whereas making genome assemblies has become routine, understand...

Full description

Bibliographic Details
Main Authors: Alexander P. Douglass, Caoimhe E. O’Brien, Benjamin Offei, Aisling Y. Coughlan, Raúl A. Ortiz-Merino, Geraldine Butler, Kevin P. Byrne, Kenneth H. Wolfe
Format: Article
Language:English
Published: Oxford University Press 2019-03-01
Series:G3: Genes, Genomes, Genetics
Subjects:
Online Access:http://g3journal.org/lookup/doi/10.1534/g3.118.200745
id doaj-3b5f2e3a66aa45c482843f72c84a61d2
record_format Article
spelling doaj-3b5f2e3a66aa45c482843f72c84a61d22021-07-02T12:24:16ZengOxford University PressG3: Genes, Genomes, Genetics2160-18362019-03-019387988710.1534/g3.118.20074524Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence AssembliesAlexander P. DouglassCaoimhe E. O’BrienBenjamin OffeiAisling Y. CoughlanRaúl A. Ortiz-MerinoGeraldine ButlerKevin P. ByrneKenneth H. WolfeIllumina sequencing has revolutionized yeast genomics, with prices for commercial draft genome sequencing now below $200. The popular SPAdes assembler makes it simple to generate a de novo genome assembly for any yeast species. However, whereas making genome assemblies has become routine, understanding what they contain is still challenging. Here, we show how graphing the information that SPAdes provides about the length and coverage of each scaffold can be used to investigate the nature of an assembly, and to diagnose possible problems. Scaffolds derived from mitochondrial DNA, ribosomal DNA, and yeast plasmids can be identified by their high coverage. Contaminating data, such as cross-contamination from other samples in a multiplex sequencing run, can be identified by its low coverage. Scaffolds derived from the bacteriophage PhiX174 and Lambda DNAs that are frequently used as molecular standards in Illumina protocols can also be detected. Assemblies of yeast genomes with high heterozygosity, such as interspecies hybrids, often contain two types of scaffold: regions of the genome where the two alleles assembled into two separate scaffolds and each has a coverage level C, and regions where the two alleles co-assembled (collapsed) into a single scaffold that has a coverage level 2C. Visualizing the data with Coverage-vs.-Length (CVL) plots, which can be done using Microsoft Excel or Google Sheets, provides a simple method to understand the structure of a genome assembly and detect aberrant scaffolds or contigs. We provide a Python script that allows assemblies to be filtered to remove contaminants identified in CVL plots.http://g3journal.org/lookup/doi/10.1534/g3.118.200745genomicsgenome assemblybioinformaticsyeast
collection DOAJ
language English
format Article
sources DOAJ
author Alexander P. Douglass
Caoimhe E. O’Brien
Benjamin Offei
Aisling Y. Coughlan
Raúl A. Ortiz-Merino
Geraldine Butler
Kevin P. Byrne
Kenneth H. Wolfe
spellingShingle Alexander P. Douglass
Caoimhe E. O’Brien
Benjamin Offei
Aisling Y. Coughlan
Raúl A. Ortiz-Merino
Geraldine Butler
Kevin P. Byrne
Kenneth H. Wolfe
Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence Assemblies
G3: Genes, Genomes, Genetics
genomics
genome assembly
bioinformatics
yeast
author_facet Alexander P. Douglass
Caoimhe E. O’Brien
Benjamin Offei
Aisling Y. Coughlan
Raúl A. Ortiz-Merino
Geraldine Butler
Kevin P. Byrne
Kenneth H. Wolfe
author_sort Alexander P. Douglass
title Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence Assemblies
title_short Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence Assemblies
title_full Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence Assemblies
title_fullStr Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence Assemblies
title_full_unstemmed Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence Assemblies
title_sort coverage-versus-length plots, a simple quality control step for de novo yeast genome sequence assemblies
publisher Oxford University Press
series G3: Genes, Genomes, Genetics
issn 2160-1836
publishDate 2019-03-01
description Illumina sequencing has revolutionized yeast genomics, with prices for commercial draft genome sequencing now below $200. The popular SPAdes assembler makes it simple to generate a de novo genome assembly for any yeast species. However, whereas making genome assemblies has become routine, understanding what they contain is still challenging. Here, we show how graphing the information that SPAdes provides about the length and coverage of each scaffold can be used to investigate the nature of an assembly, and to diagnose possible problems. Scaffolds derived from mitochondrial DNA, ribosomal DNA, and yeast plasmids can be identified by their high coverage. Contaminating data, such as cross-contamination from other samples in a multiplex sequencing run, can be identified by its low coverage. Scaffolds derived from the bacteriophage PhiX174 and Lambda DNAs that are frequently used as molecular standards in Illumina protocols can also be detected. Assemblies of yeast genomes with high heterozygosity, such as interspecies hybrids, often contain two types of scaffold: regions of the genome where the two alleles assembled into two separate scaffolds and each has a coverage level C, and regions where the two alleles co-assembled (collapsed) into a single scaffold that has a coverage level 2C. Visualizing the data with Coverage-vs.-Length (CVL) plots, which can be done using Microsoft Excel or Google Sheets, provides a simple method to understand the structure of a genome assembly and detect aberrant scaffolds or contigs. We provide a Python script that allows assemblies to be filtered to remove contaminants identified in CVL plots.
topic genomics
genome assembly
bioinformatics
yeast
url http://g3journal.org/lookup/doi/10.1534/g3.118.200745
work_keys_str_mv AT alexanderpdouglass coverageversuslengthplotsasimplequalitycontrolstepfordenovoyeastgenomesequenceassemblies
AT caoimheeobrien coverageversuslengthplotsasimplequalitycontrolstepfordenovoyeastgenomesequenceassemblies
AT benjaminoffei coverageversuslengthplotsasimplequalitycontrolstepfordenovoyeastgenomesequenceassemblies
AT aislingycoughlan coverageversuslengthplotsasimplequalitycontrolstepfordenovoyeastgenomesequenceassemblies
AT raulaortizmerino coverageversuslengthplotsasimplequalitycontrolstepfordenovoyeastgenomesequenceassemblies
AT geraldinebutler coverageversuslengthplotsasimplequalitycontrolstepfordenovoyeastgenomesequenceassemblies
AT kevinpbyrne coverageversuslengthplotsasimplequalitycontrolstepfordenovoyeastgenomesequenceassemblies
AT kennethhwolfe coverageversuslengthplotsasimplequalitycontrolstepfordenovoyeastgenomesequenceassemblies
_version_ 1721330135478566912