Document Identification Reassignment for Inverted File Compression

碩士 === 國立交通大學 === 資訊工程系 === 88 === The inverted file is one of the most popular mechanisms to speedup the document search in an Information Retrieval System (IRS). However, the size of the inverted file might is usually enormous. Therefore, compressing the inverted file become one of the...

Full description

Bibliographic Details
Main Author: 戴憲文
Other Authors: 單智君
Format: Others
Language:en_US
Published: 2000
Online Access:http://ndltd.ncl.edu.tw/handle/14763100354300066997
Description
Summary:碩士 === 國立交通大學 === 資訊工程系 === 88 === The inverted file is one of the most popular mechanisms to speedup the document search in an Information Retrieval System (IRS). However, the size of the inverted file might is usually enormous. Therefore, compressing the inverted file become one of the most compact ways to reduce the space cost and the amount of data required to be processed. Traditionally, the d-gap technique proposed by Moffat is applied to an inverted file to replace document identifications (document IDs) by some smaller numbers. Then, these smaller numbers can be effectively encoded by a prefix-code to reduce the size of the inverted file. In this thesis, we propose two improvement of the compression procedure to increase the compression rate of the inverted file. First, owing to the clustering property in documents, a document ID reassignment procedure is proposed to reduce the gaps in the original inverted file. In this procedure, a relation graph is constructed from original inverted lists to represent the relation among documents. Then, a heuristic TSP algorithm is used to travel the graph to get a new order of document IDs. After document ID reassignment, a prefix-code is applied to the new inverted file. Second, for encoding techniques, we propose two division-based notation to represent a gap as a quotation and a remainder, and then encode these gaps by prefix-code. The simulation results show that the compression rate can be improved about 15% by the document ID reassignment. Besides, the division-based encoding technique can further increase about 3% of the compression rate.