Super-sparse principal component analyses for high-throughput genomic data

Abstract Background Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations...

Full description

Bibliographic Details
Main Authors:	Lee Youngjo, Lee Woojoo, Lee Donghwan, Pawitan Yudi
Format:	Article
Language:	English
Published:	BMC 2010-06-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/11/296

id	doaj-4d6bc81b0024436caa632f802afefccb
record_format	Article
spelling	doaj-4d6bc81b0024436caa632f802afefccb2020-11-24T21:44:52ZengBMCBMC Bioinformatics1471-21052010-06-0111129610.1186/1471-2105-11-296Super-sparse principal component analyses for high-throughput genomic dataLee YoungjoLee WoojooLee DonghwanPawitan Yudi<p>Abstract</p> <p>Background</p> <p>Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients.</p> <p>Results</p> <p>Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes.</p> <p>Conclusions</p> <p>The new method has better performance than several existing methods, particularly in the estimation of the loading vectors.</p> http://www.biomedcentral.com/1471-2105/11/296
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Lee Youngjo Lee Woojoo Lee Donghwan Pawitan Yudi
spellingShingle	Lee Youngjo Lee Woojoo Lee Donghwan Pawitan Yudi Super-sparse principal component analyses for high-throughput genomic data BMC Bioinformatics
author_facet	Lee Youngjo Lee Woojoo Lee Donghwan Pawitan Yudi
author_sort	Lee Youngjo
title	Super-sparse principal component analyses for high-throughput genomic data
title_short	Super-sparse principal component analyses for high-throughput genomic data
title_full	Super-sparse principal component analyses for high-throughput genomic data
title_fullStr	Super-sparse principal component analyses for high-throughput genomic data
title_full_unstemmed	Super-sparse principal component analyses for high-throughput genomic data
title_sort	super-sparse principal component analyses for high-throughput genomic data
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2010-06-01
description	<p>Abstract</p> <p>Background</p> <p>Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients.</p> <p>Results</p> <p>Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes.</p> <p>Conclusions</p> <p>The new method has better performance than several existing methods, particularly in the estimation of the loading vectors.</p>
url	http://www.biomedcentral.com/1471-2105/11/296
work_keys_str_mv	AT leeyoungjo supersparseprincipalcomponentanalysesforhighthroughputgenomicdata AT leewoojoo supersparseprincipalcomponentanalysesforhighthroughputgenomicdata AT leedonghwan supersparseprincipalcomponentanalysesforhighthroughputgenomicdata AT pawitanyudi supersparseprincipalcomponentanalysesforhighthroughputgenomicdata
_version_	1725908344280973312

Super-sparse principal component analyses for high-throughput genomic data

Similar Items