A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines
Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destinatio...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-07-01
|
Series: | Data |
Subjects: | |
Online Access: | https://www.mdpi.com/2306-5729/6/7/73 |
id |
doaj-65f8762e51e84cb498cecd3b4ba37c53 |
---|---|
record_format |
Article |
spelling |
doaj-65f8762e51e84cb498cecd3b4ba37c532021-07-23T13:36:50ZengMDPI AGData2306-57292021-07-016737310.3390/data6070073A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity MachinesSalah Taamneh0Mo’taz Al-Hami1Hani Bani-Salameh2Alaa E. Abdallah3Department of Computer Science, The Hashemite University, Zarqa 13133, JordanDepartment of Computer Information Systems, The Hashemite University, Zarqa 13133, JordanDepartment of Software Engineering, The Hashemite University, Zarqa 13133, JordanDepartment of Computer Science, The Hashemite University, Zarqa 13133, JordanDistributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance.https://www.mdpi.com/2306-5729/6/7/73k-means clusteringdistributed k-means algorithmactor modelactive replicationpassive replicationpeer-to-peer network |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Salah Taamneh Mo’taz Al-Hami Hani Bani-Salameh Alaa E. Abdallah |
spellingShingle |
Salah Taamneh Mo’taz Al-Hami Hani Bani-Salameh Alaa E. Abdallah A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines Data k-means clustering distributed k-means algorithm actor model active replication passive replication peer-to-peer network |
author_facet |
Salah Taamneh Mo’taz Al-Hami Hani Bani-Salameh Alaa E. Abdallah |
author_sort |
Salah Taamneh |
title |
A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines |
title_short |
A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines |
title_full |
A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines |
title_fullStr |
A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines |
title_full_unstemmed |
A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines |
title_sort |
robust distributed clustering of large data sets on a grid of commodity machines |
publisher |
MDPI AG |
series |
Data |
issn |
2306-5729 |
publishDate |
2021-07-01 |
description |
Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance. |
topic |
k-means clustering distributed k-means algorithm actor model active replication passive replication peer-to-peer network |
url |
https://www.mdpi.com/2306-5729/6/7/73 |
work_keys_str_mv |
AT salahtaamneh arobustdistributedclusteringoflargedatasetsonagridofcommoditymachines AT motazalhami arobustdistributedclusteringoflargedatasetsonagridofcommoditymachines AT hanibanisalameh arobustdistributedclusteringoflargedatasetsonagridofcommoditymachines AT alaaeabdallah arobustdistributedclusteringoflargedatasetsonagridofcommoditymachines AT salahtaamneh robustdistributedclusteringoflargedatasetsonagridofcommoditymachines AT motazalhami robustdistributedclusteringoflargedatasetsonagridofcommoditymachines AT hanibanisalameh robustdistributedclusteringoflargedatasetsonagridofcommoditymachines AT alaaeabdallah robustdistributedclusteringoflargedatasetsonagridofcommoditymachines |
_version_ |
1721288783974891520 |