Summary: | 碩士 === 國立中山大學 === 資訊工程學系研究所 === 92 === Since the World Wide Web (WWW) appeared, more and more useful information has
been available on the WWW. In order to find the information, one application of data
mining techniques on the WWW, referred to as Web mining, has become a research
area with increasing importance. Mining traversal patterns is one of the important
topics in Web mining. It focuses on how to find the Web page sequences which are
frequently browsed by users. Although the algorithms for mining association rules
(e.g., Apriori and DHP algorithms) could be applied to mine traversal patterns, they
do not utilize the property of Web transactions and generate too many invalid candidate
patterns. Thus, they could not provide good performance. Wu et al. proposed
an algorithm for mining traversal patterns, SpeedTracer, which utilizes the property
of Web transactions, i.e., the continuous property of the traversal patterns in the Web
structure. Although they decrease the number of candidate patterns generated in the
mining process, they do not efficiently utilize the property of Web transactions to
decrease the number of checks while checking the subsets of each candidate pattern.
In this thesis, we design three algorithms, which improve the SpeedTracer algorithm,
for mining traversal patterns. For the first algorithm, SpeedTracer*-I, it utilizes the
property of Web transactions to directly generate and count all candidate patterns
from user sessions. Moreover, it utilizes this property to improve the checking step,
when candidate patterns are generated. Next, according to the SpeedTracer*-I algorithm,
we propose SpeedTracer*-II and SpeedTracer*-III algorithms. In these two
algorithms, we improve the performance of the SpeedTracer*-I algorithm by decreasing
the number of times to scan the database. In the SpeedTracer*-II algorithm,
given a parameter n, we apply the SpeedTracer*-I algorithm to find Ln first, and
use Ln to generate all Ck, where k > n. After generating all candidate patterns, we
scan the database once to count all candidate patterns and then the frequent patterns
could be determined. In the SpeedTracer*-III algorithm, given a parameter n, we also
apply the SpeedTracer*-I algorithm to find Ln first, and directly generate and count
Ck from user sessions based on Ln, where k > n. The simulation results show that
the performance of the SpeedTracer*-I algorithm is better than that of the Speed-
Tracer algorithm in terms of the processing time. The simulation results also show
that SpeedTracer*-II and SpeedTracer*-III algorithms outperform SpeedTracer and
SpeedTracer*-I algorithms, because the former two algorithms need less times to scan
the database than the latter two algorithms. Moreover, from our simulation results,
we show that all of our proposed algorithms could provide better performance than
Apriori-like algorithms (e.g., FS and FDLP algorithms) in terms of the processing
time.
|