Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case

Abstract Background Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast g...

Full description

Bibliographic Details
Main Authors: Weiwen Wang, Miriam Schalamun, Alejandro Morales-Suarez, David Kainer, Benjamin Schwessinger, Robert Lanfear
Format: Article
Language:English
Published: BMC 2018-12-01
Series:BMC Genomics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12864-018-5348-8
id doaj-85af2218d9344018b0173234de179862
record_format Article
spelling doaj-85af2218d9344018b0173234de1798622020-11-25T00:14:40ZengBMCBMC Genomics1471-21642018-12-0119111510.1186/s12864-018-5348-8Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test caseWeiwen Wang0Miriam Schalamun1Alejandro Morales-Suarez2David Kainer3Benjamin Schwessinger4Robert Lanfear5Research School of Biology, Australian National UniversityResearch School of Biology, Australian National UniversityDepartment of Biological Sciences, Macquarie UniversityResearch School of Biology, Australian National UniversityResearch School of Biology, Australian National UniversityResearch School of Biology, Australian National UniversityAbstract Background Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10–30 kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly. Results Hybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate. The chloroplast genome of Eucalyptus pauciflora is 159,942 bp, contains 131 genes of known function. Conclusions Our results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least ~5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies.http://link.springer.com/article/10.1186/s12864-018-5348-8Chloroplast genomeGenome assemblyPolishingIlluminaLong-readsNanopore
collection DOAJ
language English
format Article
sources DOAJ
author Weiwen Wang
Miriam Schalamun
Alejandro Morales-Suarez
David Kainer
Benjamin Schwessinger
Robert Lanfear
spellingShingle Weiwen Wang
Miriam Schalamun
Alejandro Morales-Suarez
David Kainer
Benjamin Schwessinger
Robert Lanfear
Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
BMC Genomics
Chloroplast genome
Genome assembly
Polishing
Illumina
Long-reads
Nanopore
author_facet Weiwen Wang
Miriam Schalamun
Alejandro Morales-Suarez
David Kainer
Benjamin Schwessinger
Robert Lanfear
author_sort Weiwen Wang
title Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title_short Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title_full Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title_fullStr Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title_full_unstemmed Assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using Eucalyptus pauciflora as a test case
title_sort assembly of chloroplast genomes with long- and short-read data: a comparison of approaches using eucalyptus pauciflora as a test case
publisher BMC
series BMC Genomics
issn 1471-2164
publishDate 2018-12-01
description Abstract Background Chloroplasts are organelles that conduct photosynthesis in plant and algal cells. The information chloroplast genome contained is widely used in agriculture and studies of evolution and ecology. Correctly assembling chloroplast genomes can be challenging because the chloroplast genome contains a pair of long inverted repeats (10–30 kb). Typically, it is simply assumed that the gross structure of the chloroplast genome matches the most commonly observed structure of two single-copy regions separated by a pair of inverted repeats. The advent of long-read sequencing technologies should remove the need to make this assumption by providing sufficient information to completely span the inverted repeat regions. Yet, long-reads tend to have higher error rates than short-reads, and relatively little is known about the best way to combine long- and short-reads to obtain the most accurate chloroplast genome assemblies. Using Eucalyptus pauciflora, the snow gum, as a test case, we evaluated the effect of multiple parameters, such as different coverage of long-(Oxford nanopore) and short-(Illumina) reads, different long-read lengths, different assembly pipelines, with a view to determining the most accurate and efficient approach to chloroplast genome assembly. Results Hybrid assemblies combining at least 20x coverage of both long-reads and short-reads generated a single contig spanning the entire chloroplast genome with few or no detectable errors. Short-read-only assemblies generated three contigs (the long single copy, short single copy and inverted repeat regions) of the chloroplast genome. These contigs contained few single-base errors but tended to exclude several bases at the beginning or end of each contig. Long-read-only assemblies tended to create multiple contigs with a much higher single-base error rate. The chloroplast genome of Eucalyptus pauciflora is 159,942 bp, contains 131 genes of known function. Conclusions Our results suggest that very accurate assemblies of chloroplast genomes can be achieved using a combination of at least 20x coverage of long- and short-reads respectively, provided that the long-reads contain at least ~5x coverage of reads longer than the inverted repeat region. We show that further increases in coverage give little or no improvement in accuracy, and that hybrid assemblies are more accurate than long-read-only or short-read-only assemblies.
topic Chloroplast genome
Genome assembly
Polishing
Illumina
Long-reads
Nanopore
url http://link.springer.com/article/10.1186/s12864-018-5348-8
work_keys_str_mv AT weiwenwang assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
AT miriamschalamun assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
AT alejandromoralessuarez assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
AT davidkainer assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
AT benjaminschwessinger assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
AT robertlanfear assemblyofchloroplastgenomeswithlongandshortreaddataacomparisonofapproachesusingeucalyptuspaucifloraasatestcase
_version_ 1725389190516965376