Using the GAAC Clustering Method to Improve the Ranking of Chinese Retrieval Systems

碩士 === 淡江大學 === 資訊管理學系碩士班 === 97 === Traditional desktop search engines such as Google desktop search or the TFIDF vector space retrieval system usually return a document ranking which still takes time to filter the desired documents. To improve the document ranking, this work proposes a two stage c...

Full description

Bibliographic Details
Main Authors: Chien-Jung Chou, 周建榮
Other Authors: 魏世杰
Format: Others
Language:zh-TW
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/04425669163524488237
Description
Summary:碩士 === 淡江大學 === 資訊管理學系碩士班 === 97 === Traditional desktop search engines such as Google desktop search or the TFIDF vector space retrieval system usually return a document ranking which still takes time to filter the desired documents. To improve the document ranking, this work proposes a two stage clustering scheme. Based on the returned snippets, the first stage divides the documents into two groups. The first group contains all keywords in the query and the second group contains partial or no query keywords. The ranking of the first group will be ahead of the second group. In the second stage, the first group is further applied the Group-Average Agglomerative Clustering (GACC) to form hierarchical clusters that all have a combination similarity above a given threshold. Based on the GAAC result, non-singleton clusters are ordered from high to low by their last combination similarity. Within each cluster, the two last combining subclusters are also ordered from high to low by their last combination similarity. Having a combination similarity of 0, singleton clusters will be located behind following their initial snippet order. As test dataset, a standard Chinese news dataset is used which consists of 49210 documents and 42 enquiry topics. An original document ranking is obtained from Google Desktop Search and a TFIDF vector space retrieval system respectively. Then the snippets are tokenized and filtered to extract the representative keywords and form the snippet vectors. The snippets then go through the two stage clustering scheme to adjust their ranking. The result shows that the two stage clustering scheme can improve the document ranking and the processing time is short.