Statistical issues in the analysis of Illumina data

<p>Abstract</p> <p>Background</p> <p>Illumina bead-based arrays are becoming increasingly popular due to their high degree of replication and reported high data quality. However, little attention has been paid to the pre-processing of Illumina data. In this paper, we pr...

Full description

Bibliographic Details
Main Authors: Tavaré Simon, Lynch Andy G, Barbosa-Morais Nuno L, Dunning Mark J, Ritchie Matthew E
Format: Article
Language:English
Published: BMC 2008-02-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/9/85
id doaj-4619267637854fc296b1726e194520db
record_format Article
spelling doaj-4619267637854fc296b1726e194520db2020-11-24T22:02:43ZengBMCBMC Bioinformatics1471-21052008-02-01918510.1186/1471-2105-9-85Statistical issues in the analysis of Illumina dataTavaré SimonLynch Andy GBarbosa-Morais Nuno LDunning Mark JRitchie Matthew E<p>Abstract</p> <p>Background</p> <p>Illumina bead-based arrays are becoming increasingly popular due to their high degree of replication and reported high data quality. However, little attention has been paid to the pre-processing of Illumina data. In this paper, we present our experience of analysing the raw data from an Illumina spike-in experiment and offer guidelines for those wishing to analyse expression data or develop new methodologies for this technology.</p> <p>Results</p> <p>We find that the local background estimated by Illumina is consistently low, and subtracting this background is beneficial for detecting differential expression (DE). Illumina's summary method performs well at removing outliers, producing estimates which are less biased and are less variable than other robust summary methods. However, quality assessment on summarised data may miss spatial artefacts present in the raw data. Also, we find that the background normalisation method used in Illumina's proprietary software (BeadStudio) can cause problems with a standard DE analysis. We demonstrate that variances calculated from the raw data can be used as inverse weights in the DE analysis to improve power. Finally, variability in both expression levels and DE statistics can be attributed to differences in probe composition. These differences are not accounted for by current analysis methods and require further investigation.</p> <p>Conclusion</p> <p>Analysing Illumina expression data using BeadStudio is reasonable because of the conservative estimates of summary values produced by the software. Improvements can however be made by not using background normalisation. Access to the raw data allows for a more detailed quality assessment and flexible analyses. In the case of a gene expression study, data can be analysed on an appropriate scale using established tools. Similar improvements can be expected for other Illumina assays.</p> http://www.biomedcentral.com/1471-2105/9/85
collection DOAJ
language English
format Article
sources DOAJ
author Tavaré Simon
Lynch Andy G
Barbosa-Morais Nuno L
Dunning Mark J
Ritchie Matthew E
spellingShingle Tavaré Simon
Lynch Andy G
Barbosa-Morais Nuno L
Dunning Mark J
Ritchie Matthew E
Statistical issues in the analysis of Illumina data
BMC Bioinformatics
author_facet Tavaré Simon
Lynch Andy G
Barbosa-Morais Nuno L
Dunning Mark J
Ritchie Matthew E
author_sort Tavaré Simon
title Statistical issues in the analysis of Illumina data
title_short Statistical issues in the analysis of Illumina data
title_full Statistical issues in the analysis of Illumina data
title_fullStr Statistical issues in the analysis of Illumina data
title_full_unstemmed Statistical issues in the analysis of Illumina data
title_sort statistical issues in the analysis of illumina data
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2008-02-01
description <p>Abstract</p> <p>Background</p> <p>Illumina bead-based arrays are becoming increasingly popular due to their high degree of replication and reported high data quality. However, little attention has been paid to the pre-processing of Illumina data. In this paper, we present our experience of analysing the raw data from an Illumina spike-in experiment and offer guidelines for those wishing to analyse expression data or develop new methodologies for this technology.</p> <p>Results</p> <p>We find that the local background estimated by Illumina is consistently low, and subtracting this background is beneficial for detecting differential expression (DE). Illumina's summary method performs well at removing outliers, producing estimates which are less biased and are less variable than other robust summary methods. However, quality assessment on summarised data may miss spatial artefacts present in the raw data. Also, we find that the background normalisation method used in Illumina's proprietary software (BeadStudio) can cause problems with a standard DE analysis. We demonstrate that variances calculated from the raw data can be used as inverse weights in the DE analysis to improve power. Finally, variability in both expression levels and DE statistics can be attributed to differences in probe composition. These differences are not accounted for by current analysis methods and require further investigation.</p> <p>Conclusion</p> <p>Analysing Illumina expression data using BeadStudio is reasonable because of the conservative estimates of summary values produced by the software. Improvements can however be made by not using background normalisation. Access to the raw data allows for a more detailed quality assessment and flexible analyses. In the case of a gene expression study, data can be analysed on an appropriate scale using established tools. Similar improvements can be expected for other Illumina assays.</p>
url http://www.biomedcentral.com/1471-2105/9/85
work_keys_str_mv AT tavaresimon statisticalissuesintheanalysisofilluminadata
AT lynchandyg statisticalissuesintheanalysisofilluminadata
AT barbosamoraisnunol statisticalissuesintheanalysisofilluminadata
AT dunningmarkj statisticalissuesintheanalysisofilluminadata
AT ritchiematthewe statisticalissuesintheanalysisofilluminadata
_version_ 1725834301109436416