A systematic performance evaluation of clustering methods for single-cell RNA-seq data [version 2; referees: 2 approved]

Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we pro...

Full description

Bibliographic Details
Main Authors: Angelo Duò, Mark D. Robinson, Charlotte Soneson
Format: Article
Language:English
Published: F1000 Research Ltd 2018-09-01
Series:F1000Research
Online Access:https://f1000research.com/articles/7-1141/v2
id doaj-776c748a49e746c1978ee1b7c6266b6d
record_format Article
spelling doaj-776c748a49e746c1978ee1b7c6266b6d2020-11-25T03:30:20ZengF1000 Research LtdF1000Research2046-14022018-09-01710.12688/f1000research.15666.217687A systematic performance evaluation of clustering methods for single-cell RNA-seq data [version 2; referees: 2 approved]Angelo Duò0Mark D. Robinson1Charlotte Soneson2Institute of Molecular Life Sciences, University of Zurich, Zurich, 8057, SwitzerlandInstitute of Molecular Life Sciences, University of Zurich, Zurich, 8057, SwitzerlandInstitute of Molecular Life Sciences, University of Zurich, Zurich, 8057, SwitzerlandSubpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. All the code used for the evaluation is available on GitHub (https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor (https://bioconductor.org/packages/DuoClustering2018).https://f1000research.com/articles/7-1141/v2
collection DOAJ
language English
format Article
sources DOAJ
author Angelo Duò
Mark D. Robinson
Charlotte Soneson
spellingShingle Angelo Duò
Mark D. Robinson
Charlotte Soneson
A systematic performance evaluation of clustering methods for single-cell RNA-seq data [version 2; referees: 2 approved]
F1000Research
author_facet Angelo Duò
Mark D. Robinson
Charlotte Soneson
author_sort Angelo Duò
title A systematic performance evaluation of clustering methods for single-cell RNA-seq data [version 2; referees: 2 approved]
title_short A systematic performance evaluation of clustering methods for single-cell RNA-seq data [version 2; referees: 2 approved]
title_full A systematic performance evaluation of clustering methods for single-cell RNA-seq data [version 2; referees: 2 approved]
title_fullStr A systematic performance evaluation of clustering methods for single-cell RNA-seq data [version 2; referees: 2 approved]
title_full_unstemmed A systematic performance evaluation of clustering methods for single-cell RNA-seq data [version 2; referees: 2 approved]
title_sort systematic performance evaluation of clustering methods for single-cell rna-seq data [version 2; referees: 2 approved]
publisher F1000 Research Ltd
series F1000Research
issn 2046-1402
publishDate 2018-09-01
description Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. All the code used for the evaluation is available on GitHub (https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor (https://bioconductor.org/packages/DuoClustering2018).
url https://f1000research.com/articles/7-1141/v2
work_keys_str_mv AT angeloduo asystematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdataversion2referees2approved
AT markdrobinson asystematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdataversion2referees2approved
AT charlottesoneson asystematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdataversion2referees2approved
AT angeloduo systematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdataversion2referees2approved
AT markdrobinson systematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdataversion2referees2approved
AT charlottesoneson systematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdataversion2referees2approved
_version_ 1724576130960195584