Maximize Data Processing Throughput on Cloud via Exploiting Data Locality and Minimizing Data Transfer Delay

碩士 === 國立清華大學 === 資訊工程學系 === 104 === In recent year, data increases in a rapid speed. With this trend, BigData becomes a signicant knowledge and the need for large scale storage and computing cluster grows up too. Because not every user has enough funds to support large amount of computers, more and...

Full description

Bibliographic Details
Main Authors: Wei, Wei Che, 魏偉哲
Other Authors: Chou, Chi Yuan
Format: Others
Language:en_US
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/25286565634431082800
Description
Summary:碩士 === 國立清華大學 === 資訊工程學系 === 104 === In recent year, data increases in a rapid speed. With this trend, BigData becomes a signicant knowledge and the need for large scale storage and computing cluster grows up too. Because not every user has enough funds to support large amount of computers, more and more company, like Amazon, Microsoft, begin to build a plat- form with many services on cloud and provide on demand service for users. However, these cloud providers usually separate those dierent kinds of services independently in order to price each service individually. For example, they will provide a storage service, a virtual machine service or a simple cluster service while these services are all independent. In a general use case, user will need to store their data in a high reliability and scalability storage system and build a computing cluster above it to analysis those data. It is not convenient to use the storage service and computing cluster service in such situation. Therefore, we develop a service to integrate these two kinds of services well and propose a data pipeline scheduling service for this scenario dealing with multiple jobs on Amazon Web Service. Beside providing a simple way to use Elastic MapReduce, the computing cluster service provided by Amazon, this service also have a good performance improvement over the basic use case proposed by Amazon.