Set-similarity joins using MapReduce

碩士 === 玄奘大學 === 資訊管理學系碩士班 === 106 === In modern time, there are ubiquitous uses of data query and massive data analysis. These are need techniques of comparing data which our study supports for. We adopted an algorithm from the paper related to set-similarity join in MapReduce framework which is nam...

Full description

Bibliographic Details
Main Authors:	WU, DONG-YUAN, 吳東原
Other Authors:	Tsai, Yao-Hong
Format:	Others
Language:	zh-TW
Published:	2018
Online Access:	http://ndltd.ncl.edu.tw/handle/4g35s3

id	ndltd-TW-106HCU00396005
record_format	oai_dc
spelling	ndltd-TW-106HCU003960052019-05-16T00:15:13Z http://ndltd.ncl.edu.tw/handle/4g35s3 Set-similarity joins using MapReduce 基於MapReduce之快速資料相似度比對法 WU, DONG-YUAN 吳東原碩士玄奘大學資訊管理學系碩士班 106 In modern time, there are ubiquitous uses of data query and massive data analysis. These are need techniques of comparing data which our study supports for. We adopted an algorithm from the paper related to set-similarity join in MapReduce framework which is named RF comparing algorithm as our groundwork. We modified RF comparing algorithm for its defect and developed a new efficient algorithm which is named Prefix Accumulating algorithm. Our solution is to identify similarities between data set with MapReduce framework and output the table for similarities between data set. There are two phases for Algorithm. We used Prefix Filtering to pick out the data that is possible to match each other from a lot amount of data, then collected the same pair of candidate for accumulating common elements in first MapReduce process. We verified the last half of data based on pair of candidate, then integrated union and intersection of data for calculating similarities in second one. In experiment, we proved that the Prefix Accumulating algorithm is faster than the RF comparing algorithm. The conclusion is that the advantage of Prefix Accumulating algorithm didn’t need to compare complete data again after prefix filtering. The disadvantage of Prefix Accumulating algorithm is that the more data partition, the more cost of integrating data. Tsai, Yao-Hong 蔡耀弘 2018 學位論文 ; thesis 66 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 玄奘大學 === 資訊管理學系碩士班 === 106 === In modern time, there are ubiquitous uses of data query and massive data analysis. These are need techniques of comparing data which our study supports for. We adopted an algorithm from the paper related to set-similarity join in MapReduce framework which is named RF comparing algorithm as our groundwork. We modified RF comparing algorithm for its defect and developed a new efficient algorithm which is named Prefix Accumulating algorithm. Our solution is to identify similarities between data set with MapReduce framework and output the table for similarities between data set. There are two phases for Algorithm. We used Prefix Filtering to pick out the data that is possible to match each other from a lot amount of data, then collected the same pair of candidate for accumulating common elements in first MapReduce process. We verified the last half of data based on pair of candidate, then integrated union and intersection of data for calculating similarities in second one. In experiment, we proved that the Prefix Accumulating algorithm is faster than the RF comparing algorithm. The conclusion is that the advantage of Prefix Accumulating algorithm didn’t need to compare complete data again after prefix filtering. The disadvantage of Prefix Accumulating algorithm is that the more data partition, the more cost of integrating data.
author2	Tsai, Yao-Hong
author_facet	Tsai, Yao-Hong WU, DONG-YUAN 吳東原
author	WU, DONG-YUAN 吳東原
spellingShingle	WU, DONG-YUAN 吳東原 Set-similarity joins using MapReduce
author_sort	WU, DONG-YUAN
title	Set-similarity joins using MapReduce
title_short	Set-similarity joins using MapReduce
title_full	Set-similarity joins using MapReduce
title_fullStr	Set-similarity joins using MapReduce
title_full_unstemmed	Set-similarity joins using MapReduce
title_sort	set-similarity joins using mapreduce
publishDate	2018
url	http://ndltd.ncl.edu.tw/handle/4g35s3
work_keys_str_mv	AT wudongyuan setsimilarityjoinsusingmapreduce AT wúdōngyuán setsimilarityjoinsusingmapreduce AT wudongyuan jīyúmapreducezhīkuàisùzīliàoxiāngshìdùbǐduìfǎ AT wúdōngyuán jīyúmapreducezhīkuàisùzīliàoxiāngshìdùbǐduìfǎ
_version_	1719162298483867648

Set-similarity joins using MapReduce

Similar Items