A Meaningful Candidate Approach to Mining Bi-Directional Traversal Patterns on the WWW

碩士 === 國立中山大學 === 資訊工程學系研究所 === 92 === Since the World Wide Web (WWW) appeared, more and more useful information has been available on the WWW. In order to find the information, one application of data mining techniques on the WWW, referred to as Web mining, has become a research area with increasing...

Full description

Bibliographic Details
Main Authors: Jiun-rung Chen, 陳俊榮
Other Authors: Ye-in Chang
Format: Others
Language:en_US
Published: 2004
Online Access:http://ndltd.ncl.edu.tw/handle/16845764293848006654
Description
Summary:碩士 === 國立中山大學 === 資訊工程學系研究所 === 92 === Since the World Wide Web (WWW) appeared, more and more useful information has been available on the WWW. In order to find the information, one application of data mining techniques on the WWW, referred to as Web mining, has become a research area with increasing importance. Mining traversal patterns is one of the important topics in Web mining. It focuses on how to find the Web page sequences which are frequently browsed by users. Although the algorithms for mining association rules (e.g., Apriori and DHP algorithms) could be applied to mine traversal patterns, they do not utilize the property of Web transactions and generate too many invalid candidate patterns. Thus, they could not provide good performance. Wu et al. proposed an algorithm for mining traversal patterns, SpeedTracer, which utilizes the property of Web transactions, i.e., the continuous property of the traversal patterns in the Web structure. Although they decrease the number of candidate patterns generated in the mining process, they do not efficiently utilize the property of Web transactions to decrease the number of checks while checking the subsets of each candidate pattern. In this thesis, we design three algorithms, which improve the SpeedTracer algorithm, for mining traversal patterns. For the first algorithm, SpeedTracer*-I, it utilizes the property of Web transactions to directly generate and count all candidate patterns from user sessions. Moreover, it utilizes this property to improve the checking step, when candidate patterns are generated. Next, according to the SpeedTracer*-I algorithm, we propose SpeedTracer*-II and SpeedTracer*-III algorithms. In these two algorithms, we improve the performance of the SpeedTracer*-I algorithm by decreasing the number of times to scan the database. In the SpeedTracer*-II algorithm, given a parameter n, we apply the SpeedTracer*-I algorithm to find Ln first, and use Ln to generate all Ck, where k > n. After generating all candidate patterns, we scan the database once to count all candidate patterns and then the frequent patterns could be determined. In the SpeedTracer*-III algorithm, given a parameter n, we also apply the SpeedTracer*-I algorithm to find Ln first, and directly generate and count Ck from user sessions based on Ln, where k > n. The simulation results show that the performance of the SpeedTracer*-I algorithm is better than that of the Speed- Tracer algorithm in terms of the processing time. The simulation results also show that SpeedTracer*-II and SpeedTracer*-III algorithms outperform SpeedTracer and SpeedTracer*-I algorithms, because the former two algorithms need less times to scan the database than the latter two algorithms. Moreover, from our simulation results, we show that all of our proposed algorithms could provide better performance than Apriori-like algorithms (e.g., FS and FDLP algorithms) in terms of the processing time.