SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA

Learning is the process of generating useful information from a huge volume of data. Learning can be either supervised learning (e.g. classification) or unsupervised learning (e.g. Clustering) Clustering is the process of grouping a set of physical objects into classes of similar object. Objects in...

Full description

Bibliographic Details
Main Authors: S. Anitha Elavarasi, J. Akilandeswari
Format: Article
Language:English
Published: ICT Academy of Tamil Nadu 2014-01-01
Series:ICTACT Journal on Soft Computing
Subjects:
Online Access:http://ictactjournals.in/paper/7_Paper_715_722.pdf
id doaj-b605bf758868465cb63b9d604b475920
record_format Article
spelling doaj-b605bf758868465cb63b9d604b4759202020-11-25T02:01:06ZengICT Academy of Tamil NaduICTACT Journal on Soft Computing0976-65612229-69562014-01-0142715722SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATAS. Anitha Elavarasi0J. Akilandeswari1Department of Computer Science and Engineering, Sona College of Technology, IndiaDepartment of Information Technology, Sona College of Technology, IndiaLearning is the process of generating useful information from a huge volume of data. Learning can be either supervised learning (e.g. classification) or unsupervised learning (e.g. Clustering) Clustering is the process of grouping a set of physical objects into classes of similar object. Objects in real world consist of both numerical and categorical data. Categorical data are not analyzed as numerical data because of the absence of inherit ordering. This paper describes about ten different clustering algorithms, its methodology and the factors influencing its performance. Each algorithm is evaluated using real world datasets and its pro and cons are specified. The various similarity / dissimilarity measure applied to categorical data and its performance is also discussed. The time complexity defines the amount of time taken by an algorithm to perform the elementary operation. The time complexity of various algorithms are discussed and its performance on real world data such as mushroom, zoo, soya bean, cancer, vote, car and iris are measured. In this survey Cluster Accuracy and Error rate for four different clustering algorithm (K-modes, fuzzy K-modes, ROCK and Squeezer), two different similarity measure (DISC and Overlap) and DILCA applied for hierarchy and partition algorithm are evaluated.http://ictactjournals.in/paper/7_Paper_715_722.pdfClusteringCategorical DataTime ComplexitySimilarity MeasureData Mining Tools
collection DOAJ
language English
format Article
sources DOAJ
author S. Anitha Elavarasi
J. Akilandeswari
spellingShingle S. Anitha Elavarasi
J. Akilandeswari
SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
ICTACT Journal on Soft Computing
Clustering
Categorical Data
Time Complexity
Similarity Measure
Data Mining Tools
author_facet S. Anitha Elavarasi
J. Akilandeswari
author_sort S. Anitha Elavarasi
title SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
title_short SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
title_full SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
title_fullStr SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
title_full_unstemmed SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
title_sort survey on clustering algorithm and similarity measure for categorical data
publisher ICT Academy of Tamil Nadu
series ICTACT Journal on Soft Computing
issn 0976-6561
2229-6956
publishDate 2014-01-01
description Learning is the process of generating useful information from a huge volume of data. Learning can be either supervised learning (e.g. classification) or unsupervised learning (e.g. Clustering) Clustering is the process of grouping a set of physical objects into classes of similar object. Objects in real world consist of both numerical and categorical data. Categorical data are not analyzed as numerical data because of the absence of inherit ordering. This paper describes about ten different clustering algorithms, its methodology and the factors influencing its performance. Each algorithm is evaluated using real world datasets and its pro and cons are specified. The various similarity / dissimilarity measure applied to categorical data and its performance is also discussed. The time complexity defines the amount of time taken by an algorithm to perform the elementary operation. The time complexity of various algorithms are discussed and its performance on real world data such as mushroom, zoo, soya bean, cancer, vote, car and iris are measured. In this survey Cluster Accuracy and Error rate for four different clustering algorithm (K-modes, fuzzy K-modes, ROCK and Squeezer), two different similarity measure (DISC and Overlap) and DILCA applied for hierarchy and partition algorithm are evaluated.
topic Clustering
Categorical Data
Time Complexity
Similarity Measure
Data Mining Tools
url http://ictactjournals.in/paper/7_Paper_715_722.pdf
work_keys_str_mv AT sanithaelavarasi surveyonclusteringalgorithmandsimilaritymeasureforcategoricaldata
AT jakilandeswari surveyonclusteringalgorithmandsimilaritymeasureforcategoricaldata
_version_ 1724958736882073600