A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters

碩士 === 國立中央大學 === 資訊工程研究所 === 100 === Recently, efficiency of clustering algorithms for large data analysis has become a major issue in many application domains. First, the computation time and storage space increases dramatically while processing a large amount of data. Second, it is hard to determ...

Full description

Bibliographic Details
Main Authors:	Shi-Shan Chen, 陳詩姍
Other Authors:	Wei-Jen Wang
Format:	Others
Language:	zh-TW
Published:	2012
Online Access:	http://ndltd.ncl.edu.tw/handle/36517890026743833619

id	ndltd-TW-100NCU05392053
record_format	oai_dc
spelling	ndltd-TW-100NCU053920532015-10-13T21:22:38Z http://ndltd.ncl.edu.tw/handle/36517890026743833619 A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters 基於圖形的平行化最小生成樹分群演算法 Shi-Shan Chen 陳詩姍碩士國立中央大學資訊工程研究所 100 Recently, efficiency of clustering algorithms for large data analysis has become a major issue in many application domains. First, the computation time and storage space increases dramatically while processing a large amount of data. Second, it is hard to determine the number of clusters automatically. In order to cluster datasets within rational computation time and storage space, we propose a parallel computing strategy using the concept of graph-based clustering and granular computing. The proposed strategy automatically determines the best number of clusters of a large datasets, and effectively reduces the computation time and storage space requirements, given a large amount of data. Based on the proposed strategy, we devise two clustering algorithms, and implement them in Message Passing Interface (MPI). Both the algorithms utilize the minimum spanning tree structure to determine the best number of clusters automatically. The first algorithm is called Para-CPLM (Parallel Clustering based on Partitions of Local Minimal-spanning-trees), while the second algorithm is called Para-CGM (Parallel Clustering based on a Global Minimal-spanning-tree). The Para-CPLM partitions the data domain into several blocks (hyper-rectangles) according to the dimensions of the datasets, utilizes a parallel method to uniformly distribute all datasets to the blocks, and then establishes a local minimal-spanning-tree in each block. After each local minimal-spanning-tree is established, it combines those local minimal-spanning-trees according to the closest Euclidean Distance and then applies the GBC method to cluster the local minimal-spanning trees. After the first clustering, it checks the distances between each cluster, and finds the closest Euclidean Distance that conforms to the rules of the minimal-spanning-tree among them. The Para-CGM constructs a global minimal-spanning-tree in parallel, and applies the GBC method to find the best number of clusters. It requires a given threshold number to process the data efficiently. From our experimental results, the Para-CPLM has significantly shorter execution time and almost the same clustering results when compared with the GBC method. On the contrary, the Para-CGM has less improvement on execution time than the Para-CPLM, given that there are enough computing resources for the Para-CPLM. However, it still outperforms the GBC method and produces the same clustering results. Wei-Jen Wang 王尉任 2012 學位論文 ; thesis 73 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中央大學 === 資訊工程研究所 === 100 === Recently, efficiency of clustering algorithms for large data analysis has become a major issue in many application domains. First, the computation time and storage space increases dramatically while processing a large amount of data. Second, it is hard to determine the number of clusters automatically. In order to cluster datasets within rational computation time and storage space, we propose a parallel computing strategy using the concept of graph-based clustering and granular computing. The proposed strategy automatically determines the best number of clusters of a large datasets, and effectively reduces the computation time and storage space requirements, given a large amount of data. Based on the proposed strategy, we devise two clustering algorithms, and implement them in Message Passing Interface (MPI). Both the algorithms utilize the minimum spanning tree structure to determine the best number of clusters automatically. The first algorithm is called Para-CPLM (Parallel Clustering based on Partitions of Local Minimal-spanning-trees), while the second algorithm is called Para-CGM (Parallel Clustering based on a Global Minimal-spanning-tree). The Para-CPLM partitions the data domain into several blocks (hyper-rectangles) according to the dimensions of the datasets, utilizes a parallel method to uniformly distribute all datasets to the blocks, and then establishes a local minimal-spanning-tree in each block. After each local minimal-spanning-tree is established, it combines those local minimal-spanning-trees according to the closest Euclidean Distance and then applies the GBC method to cluster the local minimal-spanning trees. After the first clustering, it checks the distances between each cluster, and finds the closest Euclidean Distance that conforms to the rules of the minimal-spanning-tree among them. The Para-CGM constructs a global minimal-spanning-tree in parallel, and applies the GBC method to find the best number of clusters. It requires a given threshold number to process the data efficiently. From our experimental results, the Para-CPLM has significantly shorter execution time and almost the same clustering results when compared with the GBC method. On the contrary, the Para-CGM has less improvement on execution time than the Para-CPLM, given that there are enough computing resources for the Para-CPLM. However, it still outperforms the GBC method and produces the same clustering results.
author2	Wei-Jen Wang
author_facet	Wei-Jen Wang Shi-Shan Chen 陳詩姍
author	Shi-Shan Chen 陳詩姍
spellingShingle	Shi-Shan Chen 陳詩姍 A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
author_sort	Shi-Shan Chen
title	A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
title_short	A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
title_full	A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
title_fullStr	A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
title_full_unstemmed	A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters
title_sort	parallel minimal-spanning-tree-based clustering algorithm with self-detectability of the best number of clusters
publishDate	2012
url	http://ndltd.ncl.edu.tw/handle/36517890026743833619
work_keys_str_mv	AT shishanchen aparallelminimalspanningtreebasedclusteringalgorithmwithselfdetectabilityofthebestnumberofclusters AT chénshīshān aparallelminimalspanningtreebasedclusteringalgorithmwithselfdetectabilityofthebestnumberofclusters AT shishanchen jīyútúxíngdepíngxínghuàzuìxiǎoshēngchéngshùfēnqúnyǎnsuànfǎ AT chénshīshān jīyútúxíngdepíngxínghuàzuìxiǎoshēngchéngshùfēnqúnyǎnsuànfǎ AT shishanchen parallelminimalspanningtreebasedclusteringalgorithmwithselfdetectabilityofthebestnumberofclusters AT chénshīshān parallelminimalspanningtreebasedclusteringalgorithmwithselfdetectabilityofthebestnumberofclusters
_version_	1718061574198919168

A Parallel Minimal-Spanning-Tree-Based Clustering Algorithm with Self-Detectability of the Best Number of Clusters

Similar Items