A Cross-Platform Environment for Big Data Analytics Using RHadoop

碩士 === 中華大學 === 資訊管理學系 === 104 === The quantity of data has increased dramatically with the advancing of IT technology. Traditional data analysis methods are no long enough to deal with such quantity of data. That is why big data analysis becomes one of the hottest fields. However, most of the s...

Full description

Bibliographic Details
Main Authors: Ho, Li-Fung, 何立峰
Other Authors: Wang, Su-Hua
Format: Others
Language:zh-TW
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/85246435709374365839
Description
Summary:碩士 === 中華大學 === 資訊管理學系 === 104 === The quantity of data has increased dramatically with the advancing of IT technology. Traditional data analysis methods are no long enough to deal with such quantity of data. That is why big data analysis becomes one of the hottest fields. However, most of the software programs available in the market are expensive, and people turn and look for open-source software for big data analysis. On the other hand, the open-source software programs available are skill-demanding and do not have enough functions for people’s needs. As a result, users have to have the professional ability and skills to use them effectively. The R programming language is one of the most commonly used open-source programs for data analysis. It satisfies the analysis needs for most fields of study, but users have to deal with two major limits; one is that users have to have an advanced level of programming ability to program with R, which makes it difficult for the users who are inexperienced in programming; The other is that R computes as fast as the machine on which it is installed, which is not enough to deal with the real-time demands for big data. Apache Hadoop, a distributed computing platform, provides the perfect operating environment, which makes it the top choice of open-source software for big data. However, users still have two obstacles to overcome; one is that Hadoop provides only limited machine learning and analysis capability, which is not enough for analysis needs; and the other is that it only works on Linux environment, which often deters those who are not familiar with Linux. Facing the inherent restrictions of R and Hadoop in big data, this study was intended to combine R and Hadoop on the Linux system using RHadoop as an attempt to have them complement each other. SSH, the inter-platform communication technique was introduced on Windows to develop an cross-platform environment framework for big data analysis. This solved the difficulties in working with Linux while allowing users to build R-based analysis scripts simply by choosing certain options. Then the SSH automatically allowed R that was combined with Hadoop at the Linux end for analysis and showed the analysis results to users, thus eliminating the restriction that users also have to be a programmer. Finally, the functions of analysis method management were provided to make the analysis methods of this framework more expandable. To validate the feasibility of cross-platform big data analysis environment framework, virtual machines were established to test this cross-platform framework. The test results suggested that it was capable of establishing analysis scripts at the Windows end and performing analysis at the Linux end through SSH, while allowing the management of analysis under this framework using the method management functions for expansion. This inter-platform big data analysis environment will be able to provide a complete environment for big data analysis and lower the technical barriers for analysts, which will help those who would like to conduct big data analysis but have difficulty in programming or the limited budget for the analysis to catch this big data wave.