Evaluating Clustering Techniques over Big Data in Distributed Infrastructures

Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In th...

Full description

Bibliographic Details
Main Author: Shetty, Kartik
Other Authors: Mohamed Y. Eltabakh, Advisor
Format: Others
Published: Digital WPI 2018
Subjects:
Online Access:https://digitalcommons.wpi.edu/etd-theses/1226
https://digitalcommons.wpi.edu/cgi/viewcontent.cgi?article=2225&context=etd-theses
id ndltd-wpi.edu-oai-digitalcommons.wpi.edu-etd-theses-2225
record_format oai_dc
spelling ndltd-wpi.edu-oai-digitalcommons.wpi.edu-etd-theses-22252019-03-22T05:48:40Z Evaluating Clustering Techniques over Big Data in Distributed Infrastructures Shetty, Kartik Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In this era of Big Data, we could leverage the computing power of distributed environment to achieve it over large dataset. It can be achieved through various algorithms, but in general they have high time complexities. We see that for large datasets the scalability and the parameters of the environment in which it is running become issues which needs to be addressed. Therefore it's brute force implementation is not scalable over large datasets even in a distributed environment, which calls the need for an approximation technique or optimization to make it scalable. We study three clustering techniques: CURE, DBSCAN and k-means over distributed environment like Hadoop. For each of these algorithms we understand their performance trade offs and bottlenecks and then propose enhancements or optimizations or an approximation technique to make it scalable in Hadoop. Finally we evaluate it's performance and suitability to datasets of different sizes and distributions. 2018-04-25T07:00:00Z text application/pdf https://digitalcommons.wpi.edu/etd-theses/1226 https://digitalcommons.wpi.edu/cgi/viewcontent.cgi?article=2225&context=etd-theses Masters Theses (All Theses, All Years) Digital WPI Mohamed Y. Eltabakh, Advisor Dmitry Korkin, Reader clustering hadoop
collection NDLTD
format Others
sources NDLTD
topic clustering
hadoop
spellingShingle clustering
hadoop
Shetty, Kartik
Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
description Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In this era of Big Data, we could leverage the computing power of distributed environment to achieve it over large dataset. It can be achieved through various algorithms, but in general they have high time complexities. We see that for large datasets the scalability and the parameters of the environment in which it is running become issues which needs to be addressed. Therefore it's brute force implementation is not scalable over large datasets even in a distributed environment, which calls the need for an approximation technique or optimization to make it scalable. We study three clustering techniques: CURE, DBSCAN and k-means over distributed environment like Hadoop. For each of these algorithms we understand their performance trade offs and bottlenecks and then propose enhancements or optimizations or an approximation technique to make it scalable in Hadoop. Finally we evaluate it's performance and suitability to datasets of different sizes and distributions.
author2 Mohamed Y. Eltabakh, Advisor
author_facet Mohamed Y. Eltabakh, Advisor
Shetty, Kartik
author Shetty, Kartik
author_sort Shetty, Kartik
title Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
title_short Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
title_full Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
title_fullStr Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
title_full_unstemmed Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
title_sort evaluating clustering techniques over big data in distributed infrastructures
publisher Digital WPI
publishDate 2018
url https://digitalcommons.wpi.edu/etd-theses/1226
https://digitalcommons.wpi.edu/cgi/viewcontent.cgi?article=2225&context=etd-theses
work_keys_str_mv AT shettykartik evaluatingclusteringtechniquesoverbigdataindistributedinfrastructures
_version_ 1719006344817672192