MAFsnp: A Multi-Sample Accurate and Flexible SNP Caller Using Next-Generation Sequencing Data.

Most existing statistical methods developed for calling single nucleotide polymorphisms (SNPs) using next-generation sequencing (NGS) data are based on Bayesian frameworks, and there does not exist any SNP caller that produces p-values for calling SNPs in a frequentist framework. To fill in this gap...

Full description

Bibliographic Details
Main Authors: Jiyuan Hu, Tengfei Li, Zidi Xiu, Hong Zhang
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2015-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC4550471?pdf=render
id doaj-f2957944abee431f9099eb86b035d5b1
record_format Article
spelling doaj-f2957944abee431f9099eb86b035d5b12020-11-24T21:26:33ZengPublic Library of Science (PLoS)PLoS ONE1932-62032015-01-01108e013533210.1371/journal.pone.0135332MAFsnp: A Multi-Sample Accurate and Flexible SNP Caller Using Next-Generation Sequencing Data.Jiyuan HuTengfei LiZidi XiuHong ZhangMost existing statistical methods developed for calling single nucleotide polymorphisms (SNPs) using next-generation sequencing (NGS) data are based on Bayesian frameworks, and there does not exist any SNP caller that produces p-values for calling SNPs in a frequentist framework. To fill in this gap, we develop a new method MAFsnp, a Multiple-sample based Accurate and Flexible algorithm for calling SNPs with NGS data. MAFsnp is based on an estimated likelihood ratio test (eLRT) statistic. In practical situation, the involved parameter is very close to the boundary of the parametric space, so the standard large sample property is not suitable to evaluate the finite-sample distribution of the eLRT statistic. Observing that the distribution of the test statistic is a mixture of zero and a continuous part, we propose to model the test statistic with a novel two-parameter mixture distribution. Once the parameters in the mixture distribution are estimated, p-values can be easily calculated for detecting SNPs, and the multiple-testing corrected p-values can be used to control false discovery rate (FDR) at any pre-specified level. With simulated data, MAFsnp is shown to have much better control of FDR than the existing SNP callers. Through the application to two real datasets, MAFsnp is also shown to outperform the existing SNP callers in terms of calling accuracy. An R package "MAFsnp" implementing the new SNP caller is freely available at http://homepage.fudan.edu.cn/zhangh/softwares/.http://europepmc.org/articles/PMC4550471?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Jiyuan Hu
Tengfei Li
Zidi Xiu
Hong Zhang
spellingShingle Jiyuan Hu
Tengfei Li
Zidi Xiu
Hong Zhang
MAFsnp: A Multi-Sample Accurate and Flexible SNP Caller Using Next-Generation Sequencing Data.
PLoS ONE
author_facet Jiyuan Hu
Tengfei Li
Zidi Xiu
Hong Zhang
author_sort Jiyuan Hu
title MAFsnp: A Multi-Sample Accurate and Flexible SNP Caller Using Next-Generation Sequencing Data.
title_short MAFsnp: A Multi-Sample Accurate and Flexible SNP Caller Using Next-Generation Sequencing Data.
title_full MAFsnp: A Multi-Sample Accurate and Flexible SNP Caller Using Next-Generation Sequencing Data.
title_fullStr MAFsnp: A Multi-Sample Accurate and Flexible SNP Caller Using Next-Generation Sequencing Data.
title_full_unstemmed MAFsnp: A Multi-Sample Accurate and Flexible SNP Caller Using Next-Generation Sequencing Data.
title_sort mafsnp: a multi-sample accurate and flexible snp caller using next-generation sequencing data.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2015-01-01
description Most existing statistical methods developed for calling single nucleotide polymorphisms (SNPs) using next-generation sequencing (NGS) data are based on Bayesian frameworks, and there does not exist any SNP caller that produces p-values for calling SNPs in a frequentist framework. To fill in this gap, we develop a new method MAFsnp, a Multiple-sample based Accurate and Flexible algorithm for calling SNPs with NGS data. MAFsnp is based on an estimated likelihood ratio test (eLRT) statistic. In practical situation, the involved parameter is very close to the boundary of the parametric space, so the standard large sample property is not suitable to evaluate the finite-sample distribution of the eLRT statistic. Observing that the distribution of the test statistic is a mixture of zero and a continuous part, we propose to model the test statistic with a novel two-parameter mixture distribution. Once the parameters in the mixture distribution are estimated, p-values can be easily calculated for detecting SNPs, and the multiple-testing corrected p-values can be used to control false discovery rate (FDR) at any pre-specified level. With simulated data, MAFsnp is shown to have much better control of FDR than the existing SNP callers. Through the application to two real datasets, MAFsnp is also shown to outperform the existing SNP callers in terms of calling accuracy. An R package "MAFsnp" implementing the new SNP caller is freely available at http://homepage.fudan.edu.cn/zhangh/softwares/.
url http://europepmc.org/articles/PMC4550471?pdf=render
work_keys_str_mv AT jiyuanhu mafsnpamultisampleaccurateandflexiblesnpcallerusingnextgenerationsequencingdata
AT tengfeili mafsnpamultisampleaccurateandflexiblesnpcallerusingnextgenerationsequencingdata
AT zidixiu mafsnpamultisampleaccurateandflexiblesnpcallerusingnextgenerationsequencingdata
AT hongzhang mafsnpamultisampleaccurateandflexiblesnpcallerusingnextgenerationsequencingdata
_version_ 1725979008869335040