A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destinatio...

Full description

Bibliographic Details
Main Authors: Salah Taamneh, Mo’taz Al-Hami, Hani Bani-Salameh, Alaa E. Abdallah
Format: Article
Language:English
Published: MDPI AG 2021-07-01
Series:Data
Subjects:
Online Access:https://www.mdpi.com/2306-5729/6/7/73
id doaj-65f8762e51e84cb498cecd3b4ba37c53
record_format Article
spelling doaj-65f8762e51e84cb498cecd3b4ba37c532021-07-23T13:36:50ZengMDPI AGData2306-57292021-07-016737310.3390/data6070073A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity MachinesSalah Taamneh0Mo’taz Al-Hami1Hani Bani-Salameh2Alaa E. Abdallah3Department of Computer Science, The Hashemite University, Zarqa 13133, JordanDepartment of Computer Information Systems, The Hashemite University, Zarqa 13133, JordanDepartment of Software Engineering, The Hashemite University, Zarqa 13133, JordanDepartment of Computer Science, The Hashemite University, Zarqa 13133, JordanDistributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance.https://www.mdpi.com/2306-5729/6/7/73k-means clusteringdistributed k-means algorithmactor modelactive replicationpassive replicationpeer-to-peer network
collection DOAJ
language English
format Article
sources DOAJ
author Salah Taamneh
Mo’taz Al-Hami
Hani Bani-Salameh
Alaa E. Abdallah
spellingShingle Salah Taamneh
Mo’taz Al-Hami
Hani Bani-Salameh
Alaa E. Abdallah
A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines
Data
k-means clustering
distributed k-means algorithm
actor model
active replication
passive replication
peer-to-peer network
author_facet Salah Taamneh
Mo’taz Al-Hami
Hani Bani-Salameh
Alaa E. Abdallah
author_sort Salah Taamneh
title A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines
title_short A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines
title_full A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines
title_fullStr A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines
title_full_unstemmed A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines
title_sort robust distributed clustering of large data sets on a grid of commodity machines
publisher MDPI AG
series Data
issn 2306-5729
publishDate 2021-07-01
description Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance.
topic k-means clustering
distributed k-means algorithm
actor model
active replication
passive replication
peer-to-peer network
url https://www.mdpi.com/2306-5729/6/7/73
work_keys_str_mv AT salahtaamneh arobustdistributedclusteringoflargedatasetsonagridofcommoditymachines
AT motazalhami arobustdistributedclusteringoflargedatasetsonagridofcommoditymachines
AT hanibanisalameh arobustdistributedclusteringoflargedatasetsonagridofcommoditymachines
AT alaaeabdallah arobustdistributedclusteringoflargedatasetsonagridofcommoditymachines
AT salahtaamneh robustdistributedclusteringoflargedatasetsonagridofcommoditymachines
AT motazalhami robustdistributedclusteringoflargedatasetsonagridofcommoditymachines
AT hanibanisalameh robustdistributedclusteringoflargedatasetsonagridofcommoditymachines
AT alaaeabdallah robustdistributedclusteringoflargedatasetsonagridofcommoditymachines
_version_ 1721288783974891520