A Load-balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys

碩士 === 逢甲大學 === 資訊工程學系 === 101 === Entity resolution (ER) is a long-lasting challenge in database management research. To detect records referring to the same entity across different data sources, an ER solution has to perform similarity computations for all pairs of the entities in the dataset. Mos...

Full description

Bibliographic Details
Main Authors: Yi-Chun Chiu, 邱羿郡
Other Authors: Ming-Yen Lin
Format: Others
Language:en_US
Published: 2013
Online Access:http://ndltd.ncl.edu.tw/handle/08538221817158453241
id ndltd-TW-101FCU05392071
record_format oai_dc
spelling ndltd-TW-101FCU053920712015-10-13T22:57:03Z http://ndltd.ncl.edu.tw/handle/08538221817158453241 A Load-balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys 區塊式多鍵值實體解析之MapReduce負載平衡演算法 Yi-Chun Chiu 邱羿郡 碩士 逢甲大學 資訊工程學系 101 Entity resolution (ER) is a long-lasting challenge in database management research. To detect records referring to the same entity across different data sources, an ER solution has to perform similarity computations for all pairs of the entities in the dataset. Most studies on blocking-based ER assume that one blocking key is associated with an entity. By distributing entities of the same blocking key, the number of comparisons in similarity computations can be reduced. An entity in reality may have multiple blocking keys in some applications. When the entities have a number of blocking keys, ER can be more efficient since two entities can form a similar pair only if they share several common keys. With the rapid growth of the data size, it becomes necessary to devise a blocking-based ER algorithm using the MapReduce framework to deal with the sheer volume of data collections today. Therefore, in this thesis, we propose a MapReduce algorithm to solve the ER problem for a huge collection of entities with multiple keys. The algorithm is characterized in the combination based blocking and the load-balanced matching. The combination based blocking utilizes the multiple keys to filter out unnecessary entity pairs for future matching. The load-balanced matching evenly distributes the required similarity computations to all the reducers in the matching step. The load-balanced design is to remove the bottleneck of skewed matching computations for a single node in a MapReduce framework. Our experiments using 1.4 million publication records show that the proposed algorithm is efficient and scalable. Ming-Yen Lin 林明言 2013 學位論文 ; thesis 65 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 逢甲大學 === 資訊工程學系 === 101 === Entity resolution (ER) is a long-lasting challenge in database management research. To detect records referring to the same entity across different data sources, an ER solution has to perform similarity computations for all pairs of the entities in the dataset. Most studies on blocking-based ER assume that one blocking key is associated with an entity. By distributing entities of the same blocking key, the number of comparisons in similarity computations can be reduced. An entity in reality may have multiple blocking keys in some applications. When the entities have a number of blocking keys, ER can be more efficient since two entities can form a similar pair only if they share several common keys. With the rapid growth of the data size, it becomes necessary to devise a blocking-based ER algorithm using the MapReduce framework to deal with the sheer volume of data collections today. Therefore, in this thesis, we propose a MapReduce algorithm to solve the ER problem for a huge collection of entities with multiple keys. The algorithm is characterized in the combination based blocking and the load-balanced matching. The combination based blocking utilizes the multiple keys to filter out unnecessary entity pairs for future matching. The load-balanced matching evenly distributes the required similarity computations to all the reducers in the matching step. The load-balanced design is to remove the bottleneck of skewed matching computations for a single node in a MapReduce framework. Our experiments using 1.4 million publication records show that the proposed algorithm is efficient and scalable.
author2 Ming-Yen Lin
author_facet Ming-Yen Lin
Yi-Chun Chiu
邱羿郡
author Yi-Chun Chiu
邱羿郡
spellingShingle Yi-Chun Chiu
邱羿郡
A Load-balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys
author_sort Yi-Chun Chiu
title A Load-balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys
title_short A Load-balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys
title_full A Load-balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys
title_fullStr A Load-balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys
title_full_unstemmed A Load-balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys
title_sort load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys
publishDate 2013
url http://ndltd.ncl.edu.tw/handle/08538221817158453241
work_keys_str_mv AT yichunchiu aloadbalancedmapreducealgorithmforblockingbasedentityresolutionwithmultiplekeys
AT qiūyìjùn aloadbalancedmapreducealgorithmforblockingbasedentityresolutionwithmultiplekeys
AT yichunchiu qūkuàishìduōjiànzhíshítǐjiěxīzhīmapreducefùzàipínghéngyǎnsuànfǎ
AT qiūyìjùn qūkuàishìduōjiànzhíshítǐjiěxīzhīmapreducefùzàipínghéngyǎnsuànfǎ
AT yichunchiu loadbalancedmapreducealgorithmforblockingbasedentityresolutionwithmultiplekeys
AT qiūyìjùn loadbalancedmapreducealgorithmforblockingbasedentityresolutionwithmultiplekeys
_version_ 1718082290943262720