A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers

In order to improve utilization rate of high dimensional data features, an ensemble learning method based on feature selection for entity resolution is developed. Entity resolution is regarded as a binary classification problem, an optimization model is designed to maximize each classifier’s classif...

Full description

Bibliographic Details
Main Authors: Liu Yi, Diao Xing-chun, Cao Jian-jun, Zhou Xing, Shang Yu-ling
Format: Article
Language:English
Published: Hindawi Limited 2017-01-01
Series:Mathematical Problems in Engineering
Online Access:http://dx.doi.org/10.1155/2017/4953280
id doaj-a78dc9e00754447fb7775cde0bd5bac3
record_format Article
spelling doaj-a78dc9e00754447fb7775cde0bd5bac32020-11-25T00:37:38ZengHindawi LimitedMathematical Problems in Engineering1024-123X1563-51472017-01-01201710.1155/2017/49532804953280A Method for Entity Resolution in High Dimensional Data Using Ensemble ClassifiersLiu Yi0Diao Xing-chun1Cao Jian-jun2Zhou Xing3Shang Yu-ling4PLA University of Science and Technology, Nanjing, Jiangsu 210007, ChinaPLA University of Science and Technology, Nanjing, Jiangsu 210007, ChinaNanjing Telecommunication Technology Institute, Nanjing, Jiangsu 210007, ChinaPLA University of Science and Technology, Nanjing, Jiangsu 210007, ChinaPLA University of Science and Technology, Nanjing, Jiangsu 210007, ChinaIn order to improve utilization rate of high dimensional data features, an ensemble learning method based on feature selection for entity resolution is developed. Entity resolution is regarded as a binary classification problem, an optimization model is designed to maximize each classifier’s classification accuracy and dissimilarity between classifiers and minimize cardinality of features. A modified multiobjective ant colony optimization algorithm is employed to solve the model for each base classifier, two pheromone matrices are set up, weighted product method is applied to aggregate values of two pheromone matrices, and feature’s Fisher discriminant rate of records’ similarity vector is calculated as heuristic information. A solution which is called complementary subset is selected from Pareto archive according to the descending order of three objectives to train the given base classifier. After training all base classifiers, their classification outputs are aggregated by max-wins voting method to obtain the ensemble classifiers’ final result. A simulation experiment is carried out on three classical datasets. The results show the effectiveness of our method, as well as a better performance compared with the other two methods.http://dx.doi.org/10.1155/2017/4953280
collection DOAJ
language English
format Article
sources DOAJ
author Liu Yi
Diao Xing-chun
Cao Jian-jun
Zhou Xing
Shang Yu-ling
spellingShingle Liu Yi
Diao Xing-chun
Cao Jian-jun
Zhou Xing
Shang Yu-ling
A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers
Mathematical Problems in Engineering
author_facet Liu Yi
Diao Xing-chun
Cao Jian-jun
Zhou Xing
Shang Yu-ling
author_sort Liu Yi
title A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers
title_short A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers
title_full A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers
title_fullStr A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers
title_full_unstemmed A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers
title_sort method for entity resolution in high dimensional data using ensemble classifiers
publisher Hindawi Limited
series Mathematical Problems in Engineering
issn 1024-123X
1563-5147
publishDate 2017-01-01
description In order to improve utilization rate of high dimensional data features, an ensemble learning method based on feature selection for entity resolution is developed. Entity resolution is regarded as a binary classification problem, an optimization model is designed to maximize each classifier’s classification accuracy and dissimilarity between classifiers and minimize cardinality of features. A modified multiobjective ant colony optimization algorithm is employed to solve the model for each base classifier, two pheromone matrices are set up, weighted product method is applied to aggregate values of two pheromone matrices, and feature’s Fisher discriminant rate of records’ similarity vector is calculated as heuristic information. A solution which is called complementary subset is selected from Pareto archive according to the descending order of three objectives to train the given base classifier. After training all base classifiers, their classification outputs are aggregated by max-wins voting method to obtain the ensemble classifiers’ final result. A simulation experiment is carried out on three classical datasets. The results show the effectiveness of our method, as well as a better performance compared with the other two methods.
url http://dx.doi.org/10.1155/2017/4953280
work_keys_str_mv AT liuyi amethodforentityresolutioninhighdimensionaldatausingensembleclassifiers
AT diaoxingchun amethodforentityresolutioninhighdimensionaldatausingensembleclassifiers
AT caojianjun amethodforentityresolutioninhighdimensionaldatausingensembleclassifiers
AT zhouxing amethodforentityresolutioninhighdimensionaldatausingensembleclassifiers
AT shangyuling amethodforentityresolutioninhighdimensionaldatausingensembleclassifiers
AT liuyi methodforentityresolutioninhighdimensionaldatausingensembleclassifiers
AT diaoxingchun methodforentityresolutioninhighdimensionaldatausingensembleclassifiers
AT caojianjun methodforentityresolutioninhighdimensionaldatausingensembleclassifiers
AT zhouxing methodforentityresolutioninhighdimensionaldatausingensembleclassifiers
AT shangyuling methodforentityresolutioninhighdimensionaldatausingensembleclassifiers
_version_ 1725300283848785920