A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection

Abstract Background The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is...

Full description

Bibliographic Details
Main Authors:	WeiBo Wang, Wei Sun, Wei Wang, Jin Szatkiewicz
Format:	Article
Language:	English
Published:	BMC 2018-03-01
Series:	BMC Bioinformatics
Subjects:	Bioinformatic Computational biology Next-generation sequencing
Online Access:	http://link.springer.com/article/10.1186/s12859-018-2077-6

id	doaj-4d291dc90efb4698ab2c8426ee0a49c1
record_format	Article
spelling	doaj-4d291dc90efb4698ab2c8426ee0a49c12020-11-24T20:40:18ZengBMCBMC Bioinformatics1471-21052018-03-0119111110.1186/s12859-018-2077-6A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detectionWeiBo Wang0Wei Sun1Wei Wang2Jin Szatkiewicz3Department of Computer Science, University of North Carolina at Chapel HillBiostatistics Program, Fred Hutchinson Cancer Research CenterDepartment of Computer Science, University of California, Los AngelesDepartment of Genetics, University of North Carolina at Chapel HillAbstract Background The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. Results We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as “R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG’s accuracy in CNV detection. Conclusions Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.http://link.springer.com/article/10.1186/s12859-018-2077-6BioinformaticComputational biologyNext-generation sequencing
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	WeiBo Wang Wei Sun Wei Wang Jin Szatkiewicz
spellingShingle	WeiBo Wang Wei Sun Wei Wang Jin Szatkiewicz A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection BMC Bioinformatics Bioinformatic Computational biology Next-generation sequencing
author_facet	WeiBo Wang Wei Sun Wei Wang Jin Szatkiewicz
author_sort	WeiBo Wang
title	A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title_short	A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title_full	A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title_fullStr	A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title_full_unstemmed	A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection
title_sort	randomized approach to speed up the analysis of large-scale read-count data in the application of cnv detection
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2018-03-01
description	Abstract Background The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. Results We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as “R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG’s accuracy in CNV detection. Conclusions Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.
topic	Bioinformatic Computational biology Next-generation sequencing
url	http://link.springer.com/article/10.1186/s12859-018-2077-6
work_keys_str_mv	AT weibowang arandomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT weisun arandomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT weiwang arandomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT jinszatkiewicz arandomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT weibowang randomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT weisun randomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT weiwang randomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection AT jinszatkiewicz randomizedapproachtospeeduptheanalysisoflargescalereadcountdataintheapplicationofcnvdetection
_version_	1716827530026024960

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection

Similar Items