Performance Improvement for Big Data Iterative Computing-The Case Study of Spark Program

碩士 === 國立臺北大學 === 資訊管理研究所 === 104 === The Big Data Era brings a lot of big data analysis tools. Spark, which the features of in-memory processing fit iteration and interaction data mining, is the most popular analysis tools, and the performance of data processing is better than Hadoop. However, ther...

Full description

Bibliographic Details
Main Authors: CHIU, TSE-KAI, 邱則凱
Other Authors: WEN,YEAN-FU
Format: Others
Language:zh-TW
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/993w36
Description
Summary:碩士 === 國立臺北大學 === 資訊管理研究所 === 104 === The Big Data Era brings a lot of big data analysis tools. Spark, which the features of in-memory processing fit iteration and interaction data mining, is the most popular analysis tools, and the performance of data processing is better than Hadoop. However, there are some disadvantages in Spark, such as big data causes cross-node data transferring and it also makes compute time increasing. If Spark executes less Shuffle operations, the Spark’s performance is improved. This study modified the program to enhance the performance of iterative application. Thus, this study uses three empirical researches with diverse datasets and iterations and try to find the most suitable modified program codes. Finally, we found while using the several RDD Shuffle operations that can use a strategy to download a smaller RDD replace the Shuffle operations. The simulation results show the execution improvment time is up to 30%.