Fast R Functions for Robust Correlations and Hierarchical Clustering
Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Foundation for Open Access Statistics
2012-01-01
|
Series: | Journal of Statistical Software |
Subjects: | |
Online Access: | http://www.jstatsoft.org/v46/i11/paper |
id |
doaj-ffbd5b8da09a4bf9afb7a8dc48f1f52e |
---|---|
record_format |
Article |
spelling |
doaj-ffbd5b8da09a4bf9afb7a8dc48f1f52e2020-11-25T00:48:24ZengFoundation for Open Access StatisticsJournal of Statistical Software1548-76602012-01-014611Fast R Functions for Robust Correlations and Hierarchical ClusteringPeter LangfelderSteve HorvathMany high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA.The hierarchical clustering algorithm implemented in R function hclustis an order n^3 (n is the number of clustered objects) version of a publicly available clustering algorithm (Murtagh 2012). We present the package flashClust that implements the original algorithm which in practice achieves order approximately n^2, leading to substantial time savings when clustering large data sets.http://www.jstatsoft.org/v46/i11/paperPearson correlationrobust correlationhierarchical clusteringR |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Peter Langfelder Steve Horvath |
spellingShingle |
Peter Langfelder Steve Horvath Fast R Functions for Robust Correlations and Hierarchical Clustering Journal of Statistical Software Pearson correlation robust correlation hierarchical clustering R |
author_facet |
Peter Langfelder Steve Horvath |
author_sort |
Peter Langfelder |
title |
Fast R Functions for Robust Correlations and Hierarchical Clustering |
title_short |
Fast R Functions for Robust Correlations and Hierarchical Clustering |
title_full |
Fast R Functions for Robust Correlations and Hierarchical Clustering |
title_fullStr |
Fast R Functions for Robust Correlations and Hierarchical Clustering |
title_full_unstemmed |
Fast R Functions for Robust Correlations and Hierarchical Clustering |
title_sort |
fast r functions for robust correlations and hierarchical clustering |
publisher |
Foundation for Open Access Statistics |
series |
Journal of Statistical Software |
issn |
1548-7660 |
publishDate |
2012-01-01 |
description |
Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA.The hierarchical clustering algorithm implemented in R function hclustis an order n^3 (n is the number of clustered objects) version of a publicly available clustering algorithm (Murtagh 2012). We present the package flashClust that implements the original algorithm which in practice achieves order approximately n^2, leading to substantial time savings when clustering large data sets. |
topic |
Pearson correlation robust correlation hierarchical clustering R |
url |
http://www.jstatsoft.org/v46/i11/paper |
work_keys_str_mv |
AT peterlangfelder fastrfunctionsforrobustcorrelationsandhierarchicalclustering AT stevehorvath fastrfunctionsforrobustcorrelationsandhierarchicalclustering |
_version_ |
1725256270732066816 |