ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.

The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computa...

Full description

Bibliographic Details
Main Authors:	Yunpeng Cai, Wei Zheng, Jin Yao, Yujie Yang, Volker Mai, Qi Mao, Yijun Sun
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2017-04-01
Series:	PLoS Computational Biology
Online Access:	http://europepmc.org/articles/PMC5421816?pdf=render

id	doaj-e71b7e187f314d239e27493243bfdb71
record_format	Article
spelling	doaj-e71b7e187f314d239e27493243bfdb712020-11-25T02:19:34ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582017-04-01134e100551810.1371/journal.pcbi.1005518ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.Yunpeng CaiWei ZhengJin YaoYujie YangVolker MaiQi MaoYijun SunThe rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.http://europepmc.org/articles/PMC5421816?pdf=render
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Yunpeng Cai Wei Zheng Jin Yao Yujie Yang Volker Mai Qi Mao Yijun Sun
spellingShingle	Yunpeng Cai Wei Zheng Jin Yao Yujie Yang Volker Mai Qi Mao Yijun Sun ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time. PLoS Computational Biology
author_facet	Yunpeng Cai Wei Zheng Jin Yao Yujie Yang Volker Mai Qi Mao Yijun Sun
author_sort	Yunpeng Cai
title	ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
title_short	ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
title_full	ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
title_fullStr	ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
title_full_unstemmed	ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.
title_sort	esprit-forest: parallel clustering of massive amplicon sequence data in subquadratic time.
publisher	Public Library of Science (PLoS)
series	PLoS Computational Biology
issn	1553-734X 1553-7358
publishDate	2017-04-01
description	The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.
url	http://europepmc.org/articles/PMC5421816?pdf=render
work_keys_str_mv	AT yunpengcai espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime AT weizheng espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime AT jinyao espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime AT yujieyang espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime AT volkermai espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime AT qimao espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime AT yijunsun espritforestparallelclusteringofmassiveampliconsequencedatainsubquadratictime
_version_	1724875890285871104

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.

Similar Items