Distributed Clustering for Homogeneous and Heterogeneous Databases

碩士 === 國立交通大學 === 資訊科學與工程研究所 === 105 === Clustering is an important technique widely used in many fields. Traditional centralized clustering approaches require transferring all distributed data to a single site, and then cluster them. However, due to bandwidth limitation, communication cost, and pri...

Full description

Bibliographic Details
Main Authors:	Huang, Chih-Yuan, 黃致遠
Other Authors:	Lin, Ja-Chen
Format:	Others
Language:	en_US
Published:	2017
Online Access:	http://ndltd.ncl.edu.tw/handle/84u65s

id	ndltd-TW-105NCTU5394110
record_format	oai_dc
spelling	ndltd-TW-105NCTU53941102019-05-16T00:08:09Z http://ndltd.ncl.edu.tw/handle/84u65s Distributed Clustering for Homogeneous and Heterogeneous Databases 同質性與異質性資料庫之分散式分群 Huang, Chih-Yuan 黃致遠碩士國立交通大學資訊科學與工程研究所 105 Clustering is an important technique widely used in many fields. Traditional centralized clustering approaches require transferring all distributed data to a single site, and then cluster them. However, due to bandwidth limitation, communication cost, and privacy-violation issues, it is often infeasible to transfer all data to the central site. Therefore, our distributed clustering algorithms are good alternatives to overcome the shortcomings. In this study, our distributed clustering approaches work on homogeneous and heterogeneous database structures, respectively. We propose two methods to solve these distributed clustering problems. Our first proposed method, i.e. vector-quantization (VQ) method, try to find the representatives of all local sites, and then use all the local representatives to determine the final clustering result of whole data. Our second method, i.e. affinity-aggregation (AA) method, achieves more saving in transmission cost. Without transferring the entire local representatives, we only transfer to the central site the representative information or some important characteristics of representatives. Then, we use these reduced information to determine the pairwise similarity of whole data. Finally, we obtain the clustering result. Three datasets include Iris, Wine, and Breast cancer are tested in our experiment. To simulate the homogeneous and heterogeneous database structures, we split entire dataset into several subsets. Our two distributed clustering approaches can efficiently reduce the transmission cost. Furthermore, the clustering results of our approaches are very close to that of centralized clustering approaches, and the latter use the whole information of the whole data. Lin, Ja-Chen 林志青 2017 學位論文 ; thesis 50 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立交通大學 === 資訊科學與工程研究所 === 105 === Clustering is an important technique widely used in many fields. Traditional centralized clustering approaches require transferring all distributed data to a single site, and then cluster them. However, due to bandwidth limitation, communication cost, and privacy-violation issues, it is often infeasible to transfer all data to the central site. Therefore, our distributed clustering algorithms are good alternatives to overcome the shortcomings. In this study, our distributed clustering approaches work on homogeneous and heterogeneous database structures, respectively. We propose two methods to solve these distributed clustering problems. Our first proposed method, i.e. vector-quantization (VQ) method, try to find the representatives of all local sites, and then use all the local representatives to determine the final clustering result of whole data. Our second method, i.e. affinity-aggregation (AA) method, achieves more saving in transmission cost. Without transferring the entire local representatives, we only transfer to the central site the representative information or some important characteristics of representatives. Then, we use these reduced information to determine the pairwise similarity of whole data. Finally, we obtain the clustering result. Three datasets include Iris, Wine, and Breast cancer are tested in our experiment. To simulate the homogeneous and heterogeneous database structures, we split entire dataset into several subsets. Our two distributed clustering approaches can efficiently reduce the transmission cost. Furthermore, the clustering results of our approaches are very close to that of centralized clustering approaches, and the latter use the whole information of the whole data.
author2	Lin, Ja-Chen
author_facet	Lin, Ja-Chen Huang, Chih-Yuan 黃致遠
author	Huang, Chih-Yuan 黃致遠
spellingShingle	Huang, Chih-Yuan 黃致遠 Distributed Clustering for Homogeneous and Heterogeneous Databases
author_sort	Huang, Chih-Yuan
title	Distributed Clustering for Homogeneous and Heterogeneous Databases
title_short	Distributed Clustering for Homogeneous and Heterogeneous Databases
title_full	Distributed Clustering for Homogeneous and Heterogeneous Databases
title_fullStr	Distributed Clustering for Homogeneous and Heterogeneous Databases
title_full_unstemmed	Distributed Clustering for Homogeneous and Heterogeneous Databases
title_sort	distributed clustering for homogeneous and heterogeneous databases
publishDate	2017
url	http://ndltd.ncl.edu.tw/handle/84u65s
work_keys_str_mv	AT huangchihyuan distributedclusteringforhomogeneousandheterogeneousdatabases AT huángzhìyuǎn distributedclusteringforhomogeneousandheterogeneousdatabases AT huangchihyuan tóngzhìxìngyǔyìzhìxìngzīliàokùzhīfēnsànshìfēnqún AT huángzhìyuǎn tóngzhìxìngyǔyìzhìxìngzīliàokùzhīfēnsànshìfēnqún
_version_	1719160636446867456

Distributed Clustering for Homogeneous and Heterogeneous Databases

Similar Items