Statistical Models for Next Generation Sequencing Data

Three statistical models are developed to address problems in Next-Generation Sequencing data. The first two models are designed for RNA-Seq data and the third is designed for ChIP-Seq data. The first of the RNA-Seq models uses a Bayesian non- parametric model to detect genes that are differentially...

Full description

Bibliographic Details
Main Author: Wang, Yiyi
Other Authors: Dahl, David B.
Format: Others
Language:en
Published: 2013
Subjects:
Online Access:http://hdl.handle.net/1969.1/149412
id ndltd-tamu.edu-oai-repository.tamu.edu-1969.1-149412
record_format oai_dc
spelling ndltd-tamu.edu-oai-repository.tamu.edu-1969.1-1494122013-10-05T04:02:12ZStatistical Models for Next Generation Sequencing DataWang, Yiyinext generation sequencingBayesian nonparametricsGene OntologyMCMCThree statistical models are developed to address problems in Next-Generation Sequencing data. The first two models are designed for RNA-Seq data and the third is designed for ChIP-Seq data. The first of the RNA-Seq models uses a Bayesian non- parametric model to detect genes that are differentially expressed across treatments. A negative binomial sampling distribution is used for each gene’s read count such that each gene may have its own parameters. Despite the consequent large number of parameters, parsimony is imposed by a clustering inherent in the Bayesian nonparametric framework. A Bayesian discovery procedure is adopted to calculate the probability that each gene is differentially expressed. A simulation study and real data analysis show this method will perform at least as well as existing leading methods in some cases. The second RNA-Seq model shares the framework of the first model, but replaces the usual random partition prior from the Dirichlet process by a random partition prior indexed by distances from Gene Ontology (GO). The use of the external biological information yields improvements in statistical power over the original Bayesian discovery procedure. The third model addresses the problem of identifying protein binding sites for ChIP-Seq data. An exact test via a stochastic approximation is used to test the hypothesis that the treatment effect is independent of the sequence count intensity effect. The sliding window procedure for ChIP-Seq data is followed. The p-value and the adjusted false discovery rate are calculated for each window. For the sites identified as peak regions, three candidate models are proposed for characterizing the bimodality of the ChIP-Seq data, and the stochastic approximation in Monte Carlo (SAMC) method is used for selecting the best of the three. Real data analysis shows that this method produces comparable results as other existing methods and is advantageous in identifying bimodality of the data.Dahl, David B.Liang, FamingSpiegelman, Clifford H.Hart, Jeffrey D.Klein, Patricia E.2013-10-03T14:44:20Z2013-052013-04-01May 20132013-10-03T14:44:20ZThesistextapplication/pdfhttp://hdl.handle.net/1969.1/149412en
collection NDLTD
language en
format Others
sources NDLTD
topic next generation sequencing
Bayesian nonparametrics
Gene Ontology
MCMC
spellingShingle next generation sequencing
Bayesian nonparametrics
Gene Ontology
MCMC
Wang, Yiyi
Statistical Models for Next Generation Sequencing Data
description Three statistical models are developed to address problems in Next-Generation Sequencing data. The first two models are designed for RNA-Seq data and the third is designed for ChIP-Seq data. The first of the RNA-Seq models uses a Bayesian non- parametric model to detect genes that are differentially expressed across treatments. A negative binomial sampling distribution is used for each gene’s read count such that each gene may have its own parameters. Despite the consequent large number of parameters, parsimony is imposed by a clustering inherent in the Bayesian nonparametric framework. A Bayesian discovery procedure is adopted to calculate the probability that each gene is differentially expressed. A simulation study and real data analysis show this method will perform at least as well as existing leading methods in some cases. The second RNA-Seq model shares the framework of the first model, but replaces the usual random partition prior from the Dirichlet process by a random partition prior indexed by distances from Gene Ontology (GO). The use of the external biological information yields improvements in statistical power over the original Bayesian discovery procedure. The third model addresses the problem of identifying protein binding sites for ChIP-Seq data. An exact test via a stochastic approximation is used to test the hypothesis that the treatment effect is independent of the sequence count intensity effect. The sliding window procedure for ChIP-Seq data is followed. The p-value and the adjusted false discovery rate are calculated for each window. For the sites identified as peak regions, three candidate models are proposed for characterizing the bimodality of the ChIP-Seq data, and the stochastic approximation in Monte Carlo (SAMC) method is used for selecting the best of the three. Real data analysis shows that this method produces comparable results as other existing methods and is advantageous in identifying bimodality of the data.
author2 Dahl, David B.
author_facet Dahl, David B.
Wang, Yiyi
author Wang, Yiyi
author_sort Wang, Yiyi
title Statistical Models for Next Generation Sequencing Data
title_short Statistical Models for Next Generation Sequencing Data
title_full Statistical Models for Next Generation Sequencing Data
title_fullStr Statistical Models for Next Generation Sequencing Data
title_full_unstemmed Statistical Models for Next Generation Sequencing Data
title_sort statistical models for next generation sequencing data
publishDate 2013
url http://hdl.handle.net/1969.1/149412
work_keys_str_mv AT wangyiyi statisticalmodelsfornextgenerationsequencingdata
_version_ 1716603995517091840