Fast R Functions for Robust Correlations and Hierarchical Clustering

Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied...

Full description

Bibliographic Details
Main Authors: Peter Langfelder, Steve Horvath
Format: Article
Language:English
Published: Foundation for Open Access Statistics 2012-01-01
Series:Journal of Statistical Software
Subjects:
R
Online Access:http://www.jstatsoft.org/v46/i11/paper
id doaj-ffbd5b8da09a4bf9afb7a8dc48f1f52e
record_format Article
spelling doaj-ffbd5b8da09a4bf9afb7a8dc48f1f52e2020-11-25T00:48:24ZengFoundation for Open Access StatisticsJournal of Statistical Software1548-76602012-01-014611Fast R Functions for Robust Correlations and Hierarchical ClusteringPeter LangfelderSteve HorvathMany high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA.The hierarchical clustering algorithm implemented in R function hclustis an order n^3 (n is the number of clustered objects) version of a publicly available clustering algorithm (Murtagh 2012). We present the package flashClust that implements the original algorithm which in practice achieves order approximately n^2, leading to substantial time savings when clustering large data sets.http://www.jstatsoft.org/v46/i11/paperPearson correlationrobust correlationhierarchical clusteringR
collection DOAJ
language English
format Article
sources DOAJ
author Peter Langfelder
Steve Horvath
spellingShingle Peter Langfelder
Steve Horvath
Fast R Functions for Robust Correlations and Hierarchical Clustering
Journal of Statistical Software
Pearson correlation
robust correlation
hierarchical clustering
R
author_facet Peter Langfelder
Steve Horvath
author_sort Peter Langfelder
title Fast R Functions for Robust Correlations and Hierarchical Clustering
title_short Fast R Functions for Robust Correlations and Hierarchical Clustering
title_full Fast R Functions for Robust Correlations and Hierarchical Clustering
title_fullStr Fast R Functions for Robust Correlations and Hierarchical Clustering
title_full_unstemmed Fast R Functions for Robust Correlations and Hierarchical Clustering
title_sort fast r functions for robust correlations and hierarchical clustering
publisher Foundation for Open Access Statistics
series Journal of Statistical Software
issn 1548-7660
publishDate 2012-01-01
description Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA.The hierarchical clustering algorithm implemented in R function hclustis an order n^3 (n is the number of clustered objects) version of a publicly available clustering algorithm (Murtagh 2012). We present the package flashClust that implements the original algorithm which in practice achieves order approximately n^2, leading to substantial time savings when clustering large data sets.
topic Pearson correlation
robust correlation
hierarchical clustering
R
url http://www.jstatsoft.org/v46/i11/paper
work_keys_str_mv AT peterlangfelder fastrfunctionsforrobustcorrelationsandhierarchicalclustering
AT stevehorvath fastrfunctionsforrobustcorrelationsandhierarchicalclustering
_version_ 1725256270732066816