Fast Algorithms for Mining Frequent Itemsets

博士 === 國立中正大學 === 資訊工程所 === 95 === Recent developments in information science have resulted in a surprisingly fast accumulation of data. Accordingly, efficient management of these massive bodies of data, rapid discovery of useful information, and effective decisions based on data are crucial. Data m...

Full description

Bibliographic Details
Main Authors: Yu-Chiang Li, 李育強
Other Authors: Chin-Chen Chang
Format: Others
Language:en_US
Published: 2007
Online Access:http://ndltd.ncl.edu.tw/handle/95046846857559927496
id ndltd-TW-095CCU05392017
record_format oai_dc
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立中正大學 === 資訊工程所 === 95 === Recent developments in information science have resulted in a surprisingly fast accumulation of data. Accordingly, efficient management of these massive bodies of data, rapid discovery of useful information, and effective decisions based on data are crucial. Data mining techniques have made routine the once impossible task of gathering hidden (but potentially useful) information. Such techniques have been widely applied in numerous areas and represent an important field of research. The main task of these techniques is mining association rules, in particular, discovering frequent itemsets. Many studies have established that pattern-growth method outperforms Apriori-like candidate generation methods. The performance of the pattern-growth method depends on the number of tree nodes. Accordingly, this dissertation presents a new FP-tree structure and develops an efficient approach for mining frequent itemsets, called the NFP-growth approach. The NFP-tree employs two counters in a tree node to reduce the number of tree nodes. Because, the header table of the NFP-tree is smaller than that of the FP-tree, the total number of nodes in all conditional trees can be reduced. Itemset share measures the importance of itemsets for mining association rules. The value of the itemset share provides useful information such as the total profit or total customer purchased quantity associated with an itemset in database. Mining the share-confidence framework is called share mining. The discovery of share-frequent itemsets does not have the downward closure property. Existing algorithms for discovering share-frequent itemsets are inefficient. Therefore, this dissertation proposes a novel Fast Share Measure (FSM) algorithm to efficiently generate all share-frequent itemsets. Instead of the downward closure property, FSM satisfies the level closure property. Furthermore, this dissertation also develops the Enhanced FSM (EFSM), the Support-counted FSM, and the Share-counted FSM (ShFSM) algorithms to speed up the process of discovering all share-frequent itemsets. SuFSM and ShFSM are based on EFSM, which prunes the candidates more efficiently than FSM to significantly improve performance. However, SuFSM and ShFSM waste computation time on the join and the prune steps of candidate generation in each pass, producing too many useless candidates. Therefore, this dissertation proposes the Direct Candidate Generation (DCG) algorithm, without the prune and the join steps in each pass to further reduce the running time. Utility mining is a generalized form of the share mining model. Since the Apriori pruning strategy cannot identify high utility itemsets, developing an efficient algorithm is crucial for utility mining. Therefore, this dissertation proposes the Isolated Items Discarding Strategy (IIDS), which can be applied to any existing level-wise utility mining method to reduce candidates and to improve performance. The most efficient known models for share mining are ShFSM and DCG, which also work adequately for utility mining as well. By applying IIDS to ShFSM and DCG, this dissertation describes FUM and DCG+, respectively. Data mining mechanisms have been widely applied in various businesses and manufacturing companies in many industries. Shared data and shared mined rules have become a mutually beneficial trend among business partnerships that increases productivity for all parties involved. However, this can also increase the risk of unexpected information leaks when releasing data. To conceal restrictive itemsets, a sanitization process transforms the source database into a released database from which the counterpart cannot extract sensitive rules. The transformed result also conceals non-restrictive information as an unwanted event, called a side effect or the “misses cost.” The problem of optimal sanitization, which conceals all restrictive itemsets but minimizes the misses cost, is NP-hard. To address this challenging problem, this dissertation proposes the Maximum Item Conflict First (MICF) algorithm, which has a low sanitization rate, to achieve a low misses cost. All of the proposed methods have been extensively evaluated. Simulation results reveal that the NFP-growth algorithm is superior to the FP-growth algorithm for dense datasets and real datasets. In the share mining experiment, the performance of the FSM algorithm is superior to the ZSP algorithm two to three orders of magnitude while the minimum share threshold is between 0.2% and 2%. EFSM, SuFSM, ShFSM, and DCG perform significantly better than ZSP and FSM. The performance of DCG is the best among the four algorithms in the three experimental datasets. On utility mining, for both synthetic and real datasets, experimental results reveal that the performance of FUM and DCG+ is more efficient than that of ShFSM and DCG, respectively; therefore, IIDS is an effective strategy for utility mining. For sensitive patterns hiding, experimental results demonstrate that MICF is effective, has a low sanitization rate, and can generally achieve a significantly lower misses cost than those achieved by the MinFIA, MaxFIA, IGA, and Algo2b methods in several real and artificial datasets.
author2 Chin-Chen Chang
author_facet Chin-Chen Chang
Yu-Chiang Li
李育強
author Yu-Chiang Li
李育強
spellingShingle Yu-Chiang Li
李育強
Fast Algorithms for Mining Frequent Itemsets
author_sort Yu-Chiang Li
title Fast Algorithms for Mining Frequent Itemsets
title_short Fast Algorithms for Mining Frequent Itemsets
title_full Fast Algorithms for Mining Frequent Itemsets
title_fullStr Fast Algorithms for Mining Frequent Itemsets
title_full_unstemmed Fast Algorithms for Mining Frequent Itemsets
title_sort fast algorithms for mining frequent itemsets
publishDate 2007
url http://ndltd.ncl.edu.tw/handle/95046846857559927496
work_keys_str_mv AT yuchiangli fastalgorithmsforminingfrequentitemsets
AT lǐyùqiáng fastalgorithmsforminingfrequentitemsets
AT yuchiangli tànkānpínfánxiàngmùjíhézhīkuàisùyǎnsuànfǎyánjiū
AT lǐyùqiáng tànkānpínfánxiàngmùjíhézhīkuàisùyǎnsuànfǎyánjiū
_version_ 1717748827128070144
spelling ndltd-TW-095CCU053920172015-10-13T14:08:36Z http://ndltd.ncl.edu.tw/handle/95046846857559927496 Fast Algorithms for Mining Frequent Itemsets 探勘頻繁項目集合之快速演算法研究 Yu-Chiang Li 李育強 博士 國立中正大學 資訊工程所 95 Recent developments in information science have resulted in a surprisingly fast accumulation of data. Accordingly, efficient management of these massive bodies of data, rapid discovery of useful information, and effective decisions based on data are crucial. Data mining techniques have made routine the once impossible task of gathering hidden (but potentially useful) information. Such techniques have been widely applied in numerous areas and represent an important field of research. The main task of these techniques is mining association rules, in particular, discovering frequent itemsets. Many studies have established that pattern-growth method outperforms Apriori-like candidate generation methods. The performance of the pattern-growth method depends on the number of tree nodes. Accordingly, this dissertation presents a new FP-tree structure and develops an efficient approach for mining frequent itemsets, called the NFP-growth approach. The NFP-tree employs two counters in a tree node to reduce the number of tree nodes. Because, the header table of the NFP-tree is smaller than that of the FP-tree, the total number of nodes in all conditional trees can be reduced. Itemset share measures the importance of itemsets for mining association rules. The value of the itemset share provides useful information such as the total profit or total customer purchased quantity associated with an itemset in database. Mining the share-confidence framework is called share mining. The discovery of share-frequent itemsets does not have the downward closure property. Existing algorithms for discovering share-frequent itemsets are inefficient. Therefore, this dissertation proposes a novel Fast Share Measure (FSM) algorithm to efficiently generate all share-frequent itemsets. Instead of the downward closure property, FSM satisfies the level closure property. Furthermore, this dissertation also develops the Enhanced FSM (EFSM), the Support-counted FSM, and the Share-counted FSM (ShFSM) algorithms to speed up the process of discovering all share-frequent itemsets. SuFSM and ShFSM are based on EFSM, which prunes the candidates more efficiently than FSM to significantly improve performance. However, SuFSM and ShFSM waste computation time on the join and the prune steps of candidate generation in each pass, producing too many useless candidates. Therefore, this dissertation proposes the Direct Candidate Generation (DCG) algorithm, without the prune and the join steps in each pass to further reduce the running time. Utility mining is a generalized form of the share mining model. Since the Apriori pruning strategy cannot identify high utility itemsets, developing an efficient algorithm is crucial for utility mining. Therefore, this dissertation proposes the Isolated Items Discarding Strategy (IIDS), which can be applied to any existing level-wise utility mining method to reduce candidates and to improve performance. The most efficient known models for share mining are ShFSM and DCG, which also work adequately for utility mining as well. By applying IIDS to ShFSM and DCG, this dissertation describes FUM and DCG+, respectively. Data mining mechanisms have been widely applied in various businesses and manufacturing companies in many industries. Shared data and shared mined rules have become a mutually beneficial trend among business partnerships that increases productivity for all parties involved. However, this can also increase the risk of unexpected information leaks when releasing data. To conceal restrictive itemsets, a sanitization process transforms the source database into a released database from which the counterpart cannot extract sensitive rules. The transformed result also conceals non-restrictive information as an unwanted event, called a side effect or the “misses cost.” The problem of optimal sanitization, which conceals all restrictive itemsets but minimizes the misses cost, is NP-hard. To address this challenging problem, this dissertation proposes the Maximum Item Conflict First (MICF) algorithm, which has a low sanitization rate, to achieve a low misses cost. All of the proposed methods have been extensively evaluated. Simulation results reveal that the NFP-growth algorithm is superior to the FP-growth algorithm for dense datasets and real datasets. In the share mining experiment, the performance of the FSM algorithm is superior to the ZSP algorithm two to three orders of magnitude while the minimum share threshold is between 0.2% and 2%. EFSM, SuFSM, ShFSM, and DCG perform significantly better than ZSP and FSM. The performance of DCG is the best among the four algorithms in the three experimental datasets. On utility mining, for both synthetic and real datasets, experimental results reveal that the performance of FUM and DCG+ is more efficient than that of ShFSM and DCG, respectively; therefore, IIDS is an effective strategy for utility mining. For sensitive patterns hiding, experimental results demonstrate that MICF is effective, has a low sanitization rate, and can generally achieve a significantly lower misses cost than those achieved by the MinFIA, MaxFIA, IGA, and Algo2b methods in several real and artificial datasets. Chin-Chen Chang 張真誠 2007 學位論文 ; thesis 143 en_US