Document Identification Reassignment for Inverted File Compression

碩士 === 國立交通大學 === 資訊工程系 === 88 === The inverted file is one of the most popular mechanisms to speedup the document search in an Information Retrieval System (IRS). However, the size of the inverted file might is usually enormous. Therefore, compressing the inverted file become one of the...

Full description

Bibliographic Details
Main Author: 戴憲文
Other Authors: 單智君
Format: Others
Language:en_US
Published: 2000
Online Access:http://ndltd.ncl.edu.tw/handle/14763100354300066997
id ndltd-TW-088NCTU0392045
record_format oai_dc
spelling ndltd-TW-088NCTU03920452015-10-13T10:59:52Z http://ndltd.ncl.edu.tw/handle/14763100354300066997 Document Identification Reassignment for Inverted File Compression 改進轉置檔壓縮之文件辨識碼重編技術 戴憲文 碩士 國立交通大學 資訊工程系 88 The inverted file is one of the most popular mechanisms to speedup the document search in an Information Retrieval System (IRS). However, the size of the inverted file might is usually enormous. Therefore, compressing the inverted file become one of the most compact ways to reduce the space cost and the amount of data required to be processed. Traditionally, the d-gap technique proposed by Moffat is applied to an inverted file to replace document identifications (document IDs) by some smaller numbers. Then, these smaller numbers can be effectively encoded by a prefix-code to reduce the size of the inverted file. In this thesis, we propose two improvement of the compression procedure to increase the compression rate of the inverted file. First, owing to the clustering property in documents, a document ID reassignment procedure is proposed to reduce the gaps in the original inverted file. In this procedure, a relation graph is constructed from original inverted lists to represent the relation among documents. Then, a heuristic TSP algorithm is used to travel the graph to get a new order of document IDs. After document ID reassignment, a prefix-code is applied to the new inverted file. Second, for encoding techniques, we propose two division-based notation to represent a gap as a quotation and a remainder, and then encode these gaps by prefix-code. The simulation results show that the compression rate can be improved about 15% by the document ID reassignment. Besides, the division-based encoding technique can further increase about 3% of the compression rate. 單智君 2000 學位論文 ; thesis 88 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立交通大學 === 資訊工程系 === 88 === The inverted file is one of the most popular mechanisms to speedup the document search in an Information Retrieval System (IRS). However, the size of the inverted file might is usually enormous. Therefore, compressing the inverted file become one of the most compact ways to reduce the space cost and the amount of data required to be processed. Traditionally, the d-gap technique proposed by Moffat is applied to an inverted file to replace document identifications (document IDs) by some smaller numbers. Then, these smaller numbers can be effectively encoded by a prefix-code to reduce the size of the inverted file. In this thesis, we propose two improvement of the compression procedure to increase the compression rate of the inverted file. First, owing to the clustering property in documents, a document ID reassignment procedure is proposed to reduce the gaps in the original inverted file. In this procedure, a relation graph is constructed from original inverted lists to represent the relation among documents. Then, a heuristic TSP algorithm is used to travel the graph to get a new order of document IDs. After document ID reassignment, a prefix-code is applied to the new inverted file. Second, for encoding techniques, we propose two division-based notation to represent a gap as a quotation and a remainder, and then encode these gaps by prefix-code. The simulation results show that the compression rate can be improved about 15% by the document ID reassignment. Besides, the division-based encoding technique can further increase about 3% of the compression rate.
author2 單智君
author_facet 單智君
戴憲文
author 戴憲文
spellingShingle 戴憲文
Document Identification Reassignment for Inverted File Compression
author_sort 戴憲文
title Document Identification Reassignment for Inverted File Compression
title_short Document Identification Reassignment for Inverted File Compression
title_full Document Identification Reassignment for Inverted File Compression
title_fullStr Document Identification Reassignment for Inverted File Compression
title_full_unstemmed Document Identification Reassignment for Inverted File Compression
title_sort document identification reassignment for inverted file compression
publishDate 2000
url http://ndltd.ncl.edu.tw/handle/14763100354300066997
work_keys_str_mv AT dàixiànwén documentidentificationreassignmentforinvertedfilecompression
AT dàixiànwén gǎijìnzhuǎnzhìdàngyāsuōzhīwénjiànbiànshímǎzhòngbiānjìshù
_version_ 1716835362648621056