SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA
Learning is the process of generating useful information from a huge volume of data. Learning can be either supervised learning (e.g. classification) or unsupervised learning (e.g. Clustering) Clustering is the process of grouping a set of physical objects into classes of similar object. Objects in...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
ICT Academy of Tamil Nadu
2014-01-01
|
Series: | ICTACT Journal on Soft Computing |
Subjects: | |
Online Access: | http://ictactjournals.in/paper/7_Paper_715_722.pdf |
id |
doaj-b605bf758868465cb63b9d604b475920 |
---|---|
record_format |
Article |
spelling |
doaj-b605bf758868465cb63b9d604b4759202020-11-25T02:01:06ZengICT Academy of Tamil NaduICTACT Journal on Soft Computing0976-65612229-69562014-01-0142715722SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATAS. Anitha Elavarasi0J. Akilandeswari1Department of Computer Science and Engineering, Sona College of Technology, IndiaDepartment of Information Technology, Sona College of Technology, IndiaLearning is the process of generating useful information from a huge volume of data. Learning can be either supervised learning (e.g. classification) or unsupervised learning (e.g. Clustering) Clustering is the process of grouping a set of physical objects into classes of similar object. Objects in real world consist of both numerical and categorical data. Categorical data are not analyzed as numerical data because of the absence of inherit ordering. This paper describes about ten different clustering algorithms, its methodology and the factors influencing its performance. Each algorithm is evaluated using real world datasets and its pro and cons are specified. The various similarity / dissimilarity measure applied to categorical data and its performance is also discussed. The time complexity defines the amount of time taken by an algorithm to perform the elementary operation. The time complexity of various algorithms are discussed and its performance on real world data such as mushroom, zoo, soya bean, cancer, vote, car and iris are measured. In this survey Cluster Accuracy and Error rate for four different clustering algorithm (K-modes, fuzzy K-modes, ROCK and Squeezer), two different similarity measure (DISC and Overlap) and DILCA applied for hierarchy and partition algorithm are evaluated.http://ictactjournals.in/paper/7_Paper_715_722.pdfClusteringCategorical DataTime ComplexitySimilarity MeasureData Mining Tools |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
S. Anitha Elavarasi J. Akilandeswari |
spellingShingle |
S. Anitha Elavarasi J. Akilandeswari SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA ICTACT Journal on Soft Computing Clustering Categorical Data Time Complexity Similarity Measure Data Mining Tools |
author_facet |
S. Anitha Elavarasi J. Akilandeswari |
author_sort |
S. Anitha Elavarasi |
title |
SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA |
title_short |
SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA |
title_full |
SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA |
title_fullStr |
SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA |
title_full_unstemmed |
SURVEY ON CLUSTERING ALGORITHM AND SIMILARITY MEASURE FOR CATEGORICAL DATA |
title_sort |
survey on clustering algorithm and similarity measure for categorical data |
publisher |
ICT Academy of Tamil Nadu |
series |
ICTACT Journal on Soft Computing |
issn |
0976-6561 2229-6956 |
publishDate |
2014-01-01 |
description |
Learning is the process of generating useful information from a huge volume of data. Learning can be either supervised learning (e.g. classification) or unsupervised learning (e.g. Clustering) Clustering is the process of grouping a set of physical objects into classes of similar object. Objects in real world consist of both numerical and categorical data. Categorical data are not analyzed as numerical data because of the absence of inherit ordering. This paper describes about ten different clustering algorithms, its methodology and the factors influencing its performance. Each algorithm is evaluated using real world datasets and its pro and cons are specified. The various similarity / dissimilarity measure applied to categorical data and its performance is also discussed. The time complexity defines the amount of time taken by an algorithm to perform the elementary operation. The time complexity of various algorithms are discussed and its performance on real world data such as mushroom, zoo, soya bean, cancer, vote, car and iris are measured. In this survey Cluster Accuracy and Error rate for four different clustering algorithm (K-modes, fuzzy K-modes, ROCK and Squeezer), two different similarity measure (DISC and Overlap) and DILCA applied for hierarchy and partition algorithm are evaluated. |
topic |
Clustering Categorical Data Time Complexity Similarity Measure Data Mining Tools |
url |
http://ictactjournals.in/paper/7_Paper_715_722.pdf |
work_keys_str_mv |
AT sanithaelavarasi surveyonclusteringalgorithmandsimilaritymeasureforcategoricaldata AT jakilandeswari surveyonclusteringalgorithmandsimilaritymeasureforcategoricaldata |
_version_ |
1724958736882073600 |