Set-similarity joins using MapReduce

碩士 === 玄奘大學 === 資訊管理學系碩士班 === 106 === In modern time, there are ubiquitous uses of data query and massive data analysis. These are need techniques of comparing data which our study supports for. We adopted an algorithm from the paper related to set-similarity join in MapReduce framework which is nam...

Full description

Bibliographic Details
Main Authors: WU, DONG-YUAN, 吳東原
Other Authors: Tsai, Yao-Hong
Format: Others
Language:zh-TW
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/4g35s3
id ndltd-TW-106HCU00396005
record_format oai_dc
spelling ndltd-TW-106HCU003960052019-05-16T00:15:13Z http://ndltd.ncl.edu.tw/handle/4g35s3 Set-similarity joins using MapReduce 基於MapReduce之快速資料相似度比對法 WU, DONG-YUAN 吳東原 碩士 玄奘大學 資訊管理學系碩士班 106 In modern time, there are ubiquitous uses of data query and massive data analysis. These are need techniques of comparing data which our study supports for. We adopted an algorithm from the paper related to set-similarity join in MapReduce framework which is named RF comparing algorithm as our groundwork. We modified RF comparing algorithm for its defect and developed a new efficient algorithm which is named Prefix Accumulating algorithm. Our solution is to identify similarities between data set with MapReduce framework and output the table for similarities between data set. There are two phases for Algorithm. We used Prefix Filtering to pick out the data that is possible to match each other from a lot amount of data, then collected the same pair of candidate for accumulating common elements in first MapReduce process. We verified the last half of data based on pair of candidate, then integrated union and intersection of data for calculating similarities in second one. In experiment, we proved that the Prefix Accumulating algorithm is faster than the RF comparing algorithm. The conclusion is that the advantage of Prefix Accumulating algorithm didn’t need to compare complete data again after prefix filtering. The disadvantage of Prefix Accumulating algorithm is that the more data partition, the more cost of integrating data. Tsai, Yao-Hong 蔡耀弘 2018 學位論文 ; thesis 66 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 玄奘大學 === 資訊管理學系碩士班 === 106 === In modern time, there are ubiquitous uses of data query and massive data analysis. These are need techniques of comparing data which our study supports for. We adopted an algorithm from the paper related to set-similarity join in MapReduce framework which is named RF comparing algorithm as our groundwork. We modified RF comparing algorithm for its defect and developed a new efficient algorithm which is named Prefix Accumulating algorithm. Our solution is to identify similarities between data set with MapReduce framework and output the table for similarities between data set. There are two phases for Algorithm. We used Prefix Filtering to pick out the data that is possible to match each other from a lot amount of data, then collected the same pair of candidate for accumulating common elements in first MapReduce process. We verified the last half of data based on pair of candidate, then integrated union and intersection of data for calculating similarities in second one. In experiment, we proved that the Prefix Accumulating algorithm is faster than the RF comparing algorithm. The conclusion is that the advantage of Prefix Accumulating algorithm didn’t need to compare complete data again after prefix filtering. The disadvantage of Prefix Accumulating algorithm is that the more data partition, the more cost of integrating data.
author2 Tsai, Yao-Hong
author_facet Tsai, Yao-Hong
WU, DONG-YUAN
吳東原
author WU, DONG-YUAN
吳東原
spellingShingle WU, DONG-YUAN
吳東原
Set-similarity joins using MapReduce
author_sort WU, DONG-YUAN
title Set-similarity joins using MapReduce
title_short Set-similarity joins using MapReduce
title_full Set-similarity joins using MapReduce
title_fullStr Set-similarity joins using MapReduce
title_full_unstemmed Set-similarity joins using MapReduce
title_sort set-similarity joins using mapreduce
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/4g35s3
work_keys_str_mv AT wudongyuan setsimilarityjoinsusingmapreduce
AT wúdōngyuán setsimilarityjoinsusingmapreduce
AT wudongyuan jīyúmapreducezhīkuàisùzīliàoxiāngshìdùbǐduìfǎ
AT wúdōngyuán jīyúmapreducezhīkuàisùzīliàoxiāngshìdùbǐduìfǎ
_version_ 1719162298483867648