Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies

Abstract Background Recent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembly of highly heterozygous genomes is still problematic when regional heterogeneity is...

Full description

Bibliographic Details
Main Authors: Michael J. Roach, Simon A. Schmidt, Anthony R. Borneman
Format: Article
Language:English
Published: BMC 2018-11-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-018-2485-7
id doaj-b9b8fed1bba94d49a913af4db16e2449
record_format Article
spelling doaj-b9b8fed1bba94d49a913af4db16e24492020-11-25T01:12:24ZengBMCBMC Bioinformatics1471-21052018-11-0119111010.1186/s12859-018-2485-7Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assembliesMichael J. Roach0Simon A. Schmidt1Anthony R. Borneman2The Australian Wine Research InstituteThe Australian Wine Research InstituteThe Australian Wine Research InstituteAbstract Background Recent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembly of highly heterozygous genomes is still problematic when regional heterogeneity is so high that haplotype homology is not recognised during assembly. This results in regional duplication rather than consolidation into allelic variants and can cause issues with downstream analysis, for example variant discovery, or haplotype reconstruction using the diploid assembly with unpaired allelic contigs. Results A new pipeline—Purge Haplotigs—was developed specifically for third-gen sequencing-based assemblies to automate the reassignment of allelic contigs, and to assist in the manual curation of genome assemblies. The pipeline uses a draft haplotype-fused assembly or a diploid assembly, read alignments, and repeat annotations to identify allelic variants in the primary assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing, and compared with a similar tool. After processing with Purge Haplotigs, haploid assemblies were less duplicated with minimal impact on genome completeness, and diploid assemblies had more pairings of allelic contigs. Conclusions Purge Haplotigs improves the haploid and diploid representations of third-gen sequencing based genome assemblies by identifying and reassigning allelic contigs. The implementation is fast and scales well with large genomes, and it is less likely to over-purge repetitive or paralogous elements compared to alignment-only based methods. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence.http://link.springer.com/article/10.1186/s12859-018-2485-7Synteny reductionRedundant contigsPolymorphic genome
collection DOAJ
language English
format Article
sources DOAJ
author Michael J. Roach
Simon A. Schmidt
Anthony R. Borneman
spellingShingle Michael J. Roach
Simon A. Schmidt
Anthony R. Borneman
Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
BMC Bioinformatics
Synteny reduction
Redundant contigs
Polymorphic genome
author_facet Michael J. Roach
Simon A. Schmidt
Anthony R. Borneman
author_sort Michael J. Roach
title Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title_short Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title_full Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title_fullStr Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title_full_unstemmed Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
title_sort purge haplotigs: allelic contig reassignment for third-gen diploid genome assemblies
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2018-11-01
description Abstract Background Recent developments in third-gen long read sequencing and diploid-aware assemblers have resulted in the rapid release of numerous reference-quality assemblies for diploid genomes. However, assembly of highly heterozygous genomes is still problematic when regional heterogeneity is so high that haplotype homology is not recognised during assembly. This results in regional duplication rather than consolidation into allelic variants and can cause issues with downstream analysis, for example variant discovery, or haplotype reconstruction using the diploid assembly with unpaired allelic contigs. Results A new pipeline—Purge Haplotigs—was developed specifically for third-gen sequencing-based assemblies to automate the reassignment of allelic contigs, and to assist in the manual curation of genome assemblies. The pipeline uses a draft haplotype-fused assembly or a diploid assembly, read alignments, and repeat annotations to identify allelic variants in the primary assembly. The pipeline was tested on a simulated dataset and on four recent diploid (phased) de novo assemblies from third-generation long-read sequencing, and compared with a similar tool. After processing with Purge Haplotigs, haploid assemblies were less duplicated with minimal impact on genome completeness, and diploid assemblies had more pairings of allelic contigs. Conclusions Purge Haplotigs improves the haploid and diploid representations of third-gen sequencing based genome assemblies by identifying and reassigning allelic contigs. The implementation is fast and scales well with large genomes, and it is less likely to over-purge repetitive or paralogous elements compared to alignment-only based methods. The software is available at https://bitbucket.org/mroachawri/purge_haplotigs under a permissive MIT licence.
topic Synteny reduction
Redundant contigs
Polymorphic genome
url http://link.springer.com/article/10.1186/s12859-018-2485-7
work_keys_str_mv AT michaeljroach purgehaplotigsalleliccontigreassignmentforthirdgendiploidgenomeassemblies
AT simonaschmidt purgehaplotigsalleliccontigreassignmentforthirdgendiploidgenomeassemblies
AT anthonyrborneman purgehaplotigsalleliccontigreassignmentforthirdgendiploidgenomeassemblies
_version_ 1725166567079018496