Incremental Sequential Pattern Mining on Spark

碩士 === 國立東華大學 === 資訊工程學系 === 106 === Sequential pattern mining has been studied in data mining research for years. The purpose is to discover sequential patterns from large datasets. Since the datasets increase faster and faster in recent years, distributed computing architecture for handling large...

Full description

Bibliographic Details
Main Authors: Kuo-Hung Hsu, 許國宏
Other Authors: Shiow-Yang Wu
Format: Others
Language:zh-TW
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/r3udqf
Description
Summary:碩士 === 國立東華大學 === 資訊工程學系 === 106 === Sequential pattern mining has been studied in data mining research for years. The purpose is to discover sequential patterns from large datasets. Since the datasets increase faster and faster in recent years, distributed computing architecture for handling large datasets is becoming more and more important. MapReduce is a computing structure that greatly simplify distributed computing tasks. Hadoop is a widely used distributed computing architecture based on the MapReduce framework. Our laboratory used to develop algorithms for static sequential pattern mining with Hadoop MapReduce. Since database update is much more frequent day by day, static sequential pattern is unable to meet the challenge. Dynamic sequential pattern mining method is in urgent demand. Therefore we want to develop an incremental sequential pattern mining algorithm to analyze large dynamic datasets. On the other hand, Hadoop distributed computing depends heavily on disk I/O. Programs running on top of it can’t handle frequent database update gracefully. Therefore many research was conducted targeting such a deficiency. Spark is one of them. It is similar to MapReduce but employs in-memory computing to reduce disk I/O. It is reported that Spark runs at least 10 times faster than Hadoop MapReduce. Therefore we implement our algorithm on Spark. We develop an incremental sequential pattern mining algorithm. The data for the last and current mining periods are split into 3 segments: obsolete data, counted data and new data. The difference between them is whether the data is still valid in current mining period. Obsolete and counted segments have already been mined in the last period while the new segment is yet to be processed. Mining result is derived from last mining result by removing pattern counts in the obsolete segment and adding counts in the new segment. The idea is to save the pattern computing workload of the obsolete and counted segments. The larger the counted proportion, the more saving can be achieved which leads to better performance. To verify the performance of our incremental sequential pattern mining algorithm on Spark, we implement our algorithm on a Spark cluster and conduct extensive experiments. We use IBM Quest Synthetic Data Generator to generate large datasets for our experiments. Experiment results show that our algorithm achieves better performance than traditional method in most aspects.