A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Abstract Background Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher...

Full description

Bibliographic Details
Main Authors:	Arghya Kusum Das, Sayan Goswami, Kisung Lee, Seung-Jong Park
Format:	Article
Language:	English
Published:	BMC 2019-12-01
Series:	BMC Genomics
Subjects:	Hybrid error correction PacBio Illumina Hadoop NoSQL
Online Access:	https://doi.org/10.1186/s12864-019-6286-9

id	doaj-2bdd415b4042490b8ed1ff99a04c8e48
record_format	Article
spelling	doaj-2bdd415b4042490b8ed1ff99a04c8e482020-12-20T12:16:02ZengBMCBMC Genomics1471-21642019-12-0120S1111510.1186/s12864-019-6286-9A hybrid and scalable error correction algorithm for indel and substitution errors of long readsArghya Kusum Das0Sayan Goswami1Kisung Lee2Seung-Jong Park3Department of Computer Science and Software Engineering, University of Wisconsin at PlattevilleSchool of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton RougeSchool of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton RougeSchool of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton RougeAbstract Background Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. Methods In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. Results ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. Conclusion ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.https://doi.org/10.1186/s12864-019-6286-9Hybrid error correctionPacBioIlluminaHadoopNoSQL
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Arghya Kusum Das Sayan Goswami Kisung Lee Seung-Jong Park
spellingShingle	Arghya Kusum Das Sayan Goswami Kisung Lee Seung-Jong Park A hybrid and scalable error correction algorithm for indel and substitution errors of long reads BMC Genomics Hybrid error correction PacBio Illumina Hadoop NoSQL
author_facet	Arghya Kusum Das Sayan Goswami Kisung Lee Seung-Jong Park
author_sort	Arghya Kusum Das
title	A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title_short	A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title_full	A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title_fullStr	A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title_full_unstemmed	A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
title_sort	hybrid and scalable error correction algorithm for indel and substitution errors of long reads
publisher	BMC
series	BMC Genomics
issn	1471-2164
publishDate	2019-12-01
description	Abstract Background Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. Methods In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. Results ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. Conclusion ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.
topic	Hybrid error correction PacBio Illumina Hadoop NoSQL
url	https://doi.org/10.1186/s12864-019-6286-9
work_keys_str_mv	AT arghyakusumdas ahybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT sayangoswami ahybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT kisunglee ahybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT seungjongpark ahybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT arghyakusumdas hybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT sayangoswami hybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT kisunglee hybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads AT seungjongpark hybridandscalableerrorcorrectionalgorithmforindelandsubstitutionerrorsoflongreads
_version_	1724376797503553536

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

Similar Items