Statistical analysis of pyrosequence data

Doctor of Philosophy === Department of Statistics === Gary L. Gadbury === Since their commercial introduction in 2005, DNA sequencing technologies have become widely available and are now cost-effective tools for determining the genetic characteristics of organisms. While the biomedical applications...

Full description

Bibliographic Details
Main Author: Keating, Karen
Language:en_US
Published: Kansas State University 2012
Subjects:
Online Access:http://hdl.handle.net/2097/14026
id ndltd-KSU-oai-krex.k-state.edu-2097-14026
record_format oai_dc
spelling ndltd-KSU-oai-krex.k-state.edu-2097-140262017-03-04T03:51:13Z Statistical analysis of pyrosequence data Keating, Karen Statistics Ecology Standardization Gini Index Pareto distribution Ecology (0329) Statistics (0463) Doctor of Philosophy Department of Statistics Gary L. Gadbury Since their commercial introduction in 2005, DNA sequencing technologies have become widely available and are now cost-effective tools for determining the genetic characteristics of organisms. While the biomedical applications of DNA sequencing are apparent, these technologies have been applied to many other research areas. One such area is community ecology, in which DNA sequence data are used to identify the presence and abundance of microscopic organisms that inhabit an environment. This is currently an active area of research, since it is generally believed that a change in the composition of microscopic species in a geographic area may signal a change in the overall health of the environment. An overview of DNA pyrosequencing, as implemented by the Roche/Life Science 454 platform, is presented and aspects of the process that can introduce variability in data are identified. Four ecological data sets that were generated by the 454 platform are used for illustration. Characteristics of these data include high dimensionality, a large proportion of zeros (usually in excess of 90%), and nonzero values that are strongly right-skewed. A nonparametric method to standardize these data is presented and effects of standardization on outliers and skewness are examined. Traditional statistical methods for analyzing macroscopic species abundance data are discussed, and the applicability of these methods to microscopic species data is examined. One objective that receives focus is the classification of microscopic species as either rare or common species. This is an important distinction since there is much evidence to suggest that the biological and environmental mechanisms that govern common species are distinctly different than the mechanisms that govern rare species. This indicates that the abundance patterns for common and rare species may follow different probability models, and the suitability of the Pareto distribution for rare species is examined. Techniques for classifying macroscopic species are shown to be ill-suited for microscopic species, and an alternative technique is presented. Recognizing that the structure of the data is similar to that of financial applications (such as insurance claims and the distribution of wealth), the Gini index and other statistics based on the Lorenz curve are explored as potential test statistics for distinguishing rare versus common species. 2012-07-13T14:46:36Z 2012-07-13T14:46:36Z 2012-07-13 2012 August Dissertation http://hdl.handle.net/2097/14026 en_US Kansas State University
collection NDLTD
language en_US
sources NDLTD
topic Statistics
Ecology
Standardization
Gini Index
Pareto distribution
Ecology (0329)
Statistics (0463)
spellingShingle Statistics
Ecology
Standardization
Gini Index
Pareto distribution
Ecology (0329)
Statistics (0463)
Keating, Karen
Statistical analysis of pyrosequence data
description Doctor of Philosophy === Department of Statistics === Gary L. Gadbury === Since their commercial introduction in 2005, DNA sequencing technologies have become widely available and are now cost-effective tools for determining the genetic characteristics of organisms. While the biomedical applications of DNA sequencing are apparent, these technologies have been applied to many other research areas. One such area is community ecology, in which DNA sequence data are used to identify the presence and abundance of microscopic organisms that inhabit an environment. This is currently an active area of research, since it is generally believed that a change in the composition of microscopic species in a geographic area may signal a change in the overall health of the environment. An overview of DNA pyrosequencing, as implemented by the Roche/Life Science 454 platform, is presented and aspects of the process that can introduce variability in data are identified. Four ecological data sets that were generated by the 454 platform are used for illustration. Characteristics of these data include high dimensionality, a large proportion of zeros (usually in excess of 90%), and nonzero values that are strongly right-skewed. A nonparametric method to standardize these data is presented and effects of standardization on outliers and skewness are examined. Traditional statistical methods for analyzing macroscopic species abundance data are discussed, and the applicability of these methods to microscopic species data is examined. One objective that receives focus is the classification of microscopic species as either rare or common species. This is an important distinction since there is much evidence to suggest that the biological and environmental mechanisms that govern common species are distinctly different than the mechanisms that govern rare species. This indicates that the abundance patterns for common and rare species may follow different probability models, and the suitability of the Pareto distribution for rare species is examined. Techniques for classifying macroscopic species are shown to be ill-suited for microscopic species, and an alternative technique is presented. Recognizing that the structure of the data is similar to that of financial applications (such as insurance claims and the distribution of wealth), the Gini index and other statistics based on the Lorenz curve are explored as potential test statistics for distinguishing rare versus common species.
author Keating, Karen
author_facet Keating, Karen
author_sort Keating, Karen
title Statistical analysis of pyrosequence data
title_short Statistical analysis of pyrosequence data
title_full Statistical analysis of pyrosequence data
title_fullStr Statistical analysis of pyrosequence data
title_full_unstemmed Statistical analysis of pyrosequence data
title_sort statistical analysis of pyrosequence data
publisher Kansas State University
publishDate 2012
url http://hdl.handle.net/2097/14026
work_keys_str_mv AT keatingkaren statisticalanalysisofpyrosequencedata
_version_ 1718418931996164096