A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers
In order to improve utilization rate of high dimensional data features, an ensemble learning method based on feature selection for entity resolution is developed. Entity resolution is regarded as a binary classification problem, an optimization model is designed to maximize each classifier’s classif...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Hindawi Limited
2017-01-01
|
Series: | Mathematical Problems in Engineering |
Online Access: | http://dx.doi.org/10.1155/2017/4953280 |
id |
doaj-a78dc9e00754447fb7775cde0bd5bac3 |
---|---|
record_format |
Article |
spelling |
doaj-a78dc9e00754447fb7775cde0bd5bac32020-11-25T00:37:38ZengHindawi LimitedMathematical Problems in Engineering1024-123X1563-51472017-01-01201710.1155/2017/49532804953280A Method for Entity Resolution in High Dimensional Data Using Ensemble ClassifiersLiu Yi0Diao Xing-chun1Cao Jian-jun2Zhou Xing3Shang Yu-ling4PLA University of Science and Technology, Nanjing, Jiangsu 210007, ChinaPLA University of Science and Technology, Nanjing, Jiangsu 210007, ChinaNanjing Telecommunication Technology Institute, Nanjing, Jiangsu 210007, ChinaPLA University of Science and Technology, Nanjing, Jiangsu 210007, ChinaPLA University of Science and Technology, Nanjing, Jiangsu 210007, ChinaIn order to improve utilization rate of high dimensional data features, an ensemble learning method based on feature selection for entity resolution is developed. Entity resolution is regarded as a binary classification problem, an optimization model is designed to maximize each classifier’s classification accuracy and dissimilarity between classifiers and minimize cardinality of features. A modified multiobjective ant colony optimization algorithm is employed to solve the model for each base classifier, two pheromone matrices are set up, weighted product method is applied to aggregate values of two pheromone matrices, and feature’s Fisher discriminant rate of records’ similarity vector is calculated as heuristic information. A solution which is called complementary subset is selected from Pareto archive according to the descending order of three objectives to train the given base classifier. After training all base classifiers, their classification outputs are aggregated by max-wins voting method to obtain the ensemble classifiers’ final result. A simulation experiment is carried out on three classical datasets. The results show the effectiveness of our method, as well as a better performance compared with the other two methods.http://dx.doi.org/10.1155/2017/4953280 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Liu Yi Diao Xing-chun Cao Jian-jun Zhou Xing Shang Yu-ling |
spellingShingle |
Liu Yi Diao Xing-chun Cao Jian-jun Zhou Xing Shang Yu-ling A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers Mathematical Problems in Engineering |
author_facet |
Liu Yi Diao Xing-chun Cao Jian-jun Zhou Xing Shang Yu-ling |
author_sort |
Liu Yi |
title |
A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers |
title_short |
A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers |
title_full |
A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers |
title_fullStr |
A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers |
title_full_unstemmed |
A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers |
title_sort |
method for entity resolution in high dimensional data using ensemble classifiers |
publisher |
Hindawi Limited |
series |
Mathematical Problems in Engineering |
issn |
1024-123X 1563-5147 |
publishDate |
2017-01-01 |
description |
In order to improve utilization rate of high dimensional data features, an ensemble learning method based on feature selection for entity resolution is developed. Entity resolution is regarded as a binary classification problem, an optimization model is designed to maximize each classifier’s classification accuracy and dissimilarity between classifiers and minimize cardinality of features. A modified multiobjective ant colony optimization algorithm is employed to solve the model for each base classifier, two pheromone matrices are set up, weighted product method is applied to aggregate values of two pheromone matrices, and feature’s Fisher discriminant rate of records’ similarity vector is calculated as heuristic information. A solution which is called complementary subset is selected from Pareto archive according to the descending order of three objectives to train the given base classifier. After training all base classifiers, their classification outputs are aggregated by max-wins voting method to obtain the ensemble classifiers’ final result. A simulation experiment is carried out on three classical datasets. The results show the effectiveness of our method, as well as a better performance compared with the other two methods. |
url |
http://dx.doi.org/10.1155/2017/4953280 |
work_keys_str_mv |
AT liuyi amethodforentityresolutioninhighdimensionaldatausingensembleclassifiers AT diaoxingchun amethodforentityresolutioninhighdimensionaldatausingensembleclassifiers AT caojianjun amethodforentityresolutioninhighdimensionaldatausingensembleclassifiers AT zhouxing amethodforentityresolutioninhighdimensionaldatausingensembleclassifiers AT shangyuling amethodforentityresolutioninhighdimensionaldatausingensembleclassifiers AT liuyi methodforentityresolutioninhighdimensionaldatausingensembleclassifiers AT diaoxingchun methodforentityresolutioninhighdimensionaldatausingensembleclassifiers AT caojianjun methodforentityresolutioninhighdimensionaldatausingensembleclassifiers AT zhouxing methodforentityresolutioninhighdimensionaldatausingensembleclassifiers AT shangyuling methodforentityresolutioninhighdimensionaldatausingensembleclassifiers |
_version_ |
1725300283848785920 |