Evaluating Clustering Techniques over Big Data in Distributed Infrastructures

Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In th...

Full description

Bibliographic Details
Main Author:	Shetty, Kartik
Other Authors:	Mohamed Y. Eltabakh, Advisor
Format:	Others
Published:	Digital WPI 2018
Subjects:	clustering hadoop
Online Access:	https://digitalcommons.wpi.edu/etd-theses/1226 https://digitalcommons.wpi.edu/cgi/viewcontent.cgi?article=2225&context=etd-theses

id	ndltd-wpi.edu-oai-digitalcommons.wpi.edu-etd-theses-2225
record_format	oai_dc
spelling	ndltd-wpi.edu-oai-digitalcommons.wpi.edu-etd-theses-22252019-03-22T05:48:40Z Evaluating Clustering Techniques over Big Data in Distributed Infrastructures Shetty, Kartik Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In this era of Big Data, we could leverage the computing power of distributed environment to achieve it over large dataset. It can be achieved through various algorithms, but in general they have high time complexities. We see that for large datasets the scalability and the parameters of the environment in which it is running become issues which needs to be addressed. Therefore it's brute force implementation is not scalable over large datasets even in a distributed environment, which calls the need for an approximation technique or optimization to make it scalable. We study three clustering techniques: CURE, DBSCAN and k-means over distributed environment like Hadoop. For each of these algorithms we understand their performance trade offs and bottlenecks and then propose enhancements or optimizations or an approximation technique to make it scalable in Hadoop. Finally we evaluate it's performance and suitability to datasets of different sizes and distributions. 2018-04-25T07:00:00Z text application/pdf https://digitalcommons.wpi.edu/etd-theses/1226 https://digitalcommons.wpi.edu/cgi/viewcontent.cgi?article=2225&context=etd-theses Masters Theses (All Theses, All Years) Digital WPI Mohamed Y. Eltabakh, Advisor Dmitry Korkin, Reader clustering hadoop
collection	NDLTD
format	Others
sources	NDLTD
topic	clustering hadoop
spellingShingle	clustering hadoop Shetty, Kartik Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
description	Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In this era of Big Data, we could leverage the computing power of distributed environment to achieve it over large dataset. It can be achieved through various algorithms, but in general they have high time complexities. We see that for large datasets the scalability and the parameters of the environment in which it is running become issues which needs to be addressed. Therefore it's brute force implementation is not scalable over large datasets even in a distributed environment, which calls the need for an approximation technique or optimization to make it scalable. We study three clustering techniques: CURE, DBSCAN and k-means over distributed environment like Hadoop. For each of these algorithms we understand their performance trade offs and bottlenecks and then propose enhancements or optimizations or an approximation technique to make it scalable in Hadoop. Finally we evaluate it's performance and suitability to datasets of different sizes and distributions.
author2	Mohamed Y. Eltabakh, Advisor
author_facet	Mohamed Y. Eltabakh, Advisor Shetty, Kartik
author	Shetty, Kartik
author_sort	Shetty, Kartik
title	Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
title_short	Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
title_full	Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
title_fullStr	Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
title_full_unstemmed	Evaluating Clustering Techniques over Big Data in Distributed Infrastructures
title_sort	evaluating clustering techniques over big data in distributed infrastructures
publisher	Digital WPI
publishDate	2018
url	https://digitalcommons.wpi.edu/etd-theses/1226 https://digitalcommons.wpi.edu/cgi/viewcontent.cgi?article=2225&context=etd-theses
work_keys_str_mv	AT shettykartik evaluatingclusteringtechniquesoverbigdataindistributedinfrastructures
_version_	1719006344817672192

Evaluating Clustering Techniques over Big Data in Distributed Infrastructures

Similar Items