Inverted File Design for Large-Scale Information Retrieval System

博士 === 國立交通大學 === 資訊工程系所 === 93 === This dissertation investigates a variety of techniques to improve efficiency in information retrieval (IR). Information retrieval systems (IRSs) are widely used in many applications, such as search engines, digital libraries, genomic sequence analyses, etc. To eff...

Full description

Bibliographic Details
Main Authors: Cher-Sheng Cheng, 鄭哲聖
Other Authors: Jean Jyh-Jiun Shann
Format: Others
Language:en_US
Published: 2005
Online Access:http://ndltd.ncl.edu.tw/handle/35663252020207351257
id ndltd-TW-093NCTU5392126
record_format oai_dc
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立交通大學 === 資訊工程系所 === 93 === This dissertation investigates a variety of techniques to improve efficiency in information retrieval (IR). Information retrieval systems (IRSs) are widely used in many applications, such as search engines, digital libraries, genomic sequence analyses, etc. To efficiently search vast amount of data, a compressed inverted file is used in an IRS to locate the desired data quickly. An inverted file contains, for each distinct term in the collection, a posting list. The query processing time of a large-scale IRS is dominated by the time needed to read and decompress the posting list for each query term. Moreover, adding a document into the collection is to add one document identifier into the posting list for each term appearing in the document, hence the length of a posting list increases with the size of document collection. This implies that the time needed to process posting lists increase as the size of document collection grows. Therefore, efficient approaches to reduce the time needed to read, decompress, and merge the posting lists are the key issues in designing a large-scale IRS. Research topics to be studied in this dissertation are (1) Efficient coding method for inverted file size reduction The first topic is to propose a novel size reduction method for compressing inverted files. Compressing an inverted file can greatly improve query performance by reducing disk I/Os, but this adds to the decompression time required. The objective of this topic is to develop a method that has both the advantages of compression ratio and fast decompression. Our approach is as follows. The foundation is interpolative coding, which compresses the document identifiers with a recursive process taking care of clustering property and yields superior compression. However, interpolative coding is computationally expensive due to a stack required in its implementation. The key idea of our proposed method is to facilitate coding and decoding processes for interpolative coding by using recursion elimination and loop unwinding. Experimental results show that our method provides fast decoding speed and excellent compression. (2) Two-level skipped inverted file for redundant decoding elimination The second topic is to propose a two-level skipped inverted file, in which a two-level skipped index is created on each compressed posting list, to reduce decompression time. A two-level skipped index can greatly reduce decompress time by skipping over unnecessary portions of the list. However, well-known skipping mechanisms are unable to efficiently implement the two-level skipped index due to their high storage overheads. The objective of this topic is to develop a space-economical two-level skipped inverted file to eliminate redundant decoding and allow fast query evaluation. For this purpose, we propose a novel skipping mechanism based on block size calculation, which can create a skipped index on each compressed posting list with very little or no storage overhead, particularly if the posting list is divided into very small blocks. Using a combination of our skipping mechanism and well-known skipping mechanisms can implement a two-level skipped index with very little storage overheads. Experimental results showed that using such a two-level skipped index can simultaneously allow extremely fast query processing of both conjunctive Boolean queries and ranked queries. (3) Document identifier assignment algorithm design for inverted file optimization The third topic is to propose a document identifier assignment (DIA) algorithm for fast query evaluation. We observe that a good DIA can make the document identifiers in the posting lists more clustered, and result in better compression as well as shorter query processing time. The objective of this topic is to develop a fast algorithm that finds an optimal DIA to minimize the average query processing time in an IRS. In a typical IRS, the distribution of query terms is skewed. Based on this fact, we propose a partition-based DIA (PBDIA) algorithm, which can efficiently assign consecutive document identifiers to those documents containing frequently used query terms. Therefore, the posting lists for frequently used query terms can be compressed better without increasing the complexity of decoding processes. This can result in reduced query processing time. (4) Inverted file partitioning for parallel IR The fourth topic is to propose an inverted file partitioning approach for parallel IR. The inverted file is generally partitioned into disjoint sub-files, each for one workstation, in an IRS that runs on a cluster of workstations. When processing a query, all workstations have to consult only their own sub-files in parallel. The objective of this topic is to develop an inverted file partitioning approach that minimizes the average query processing time of parallel query processing. Our approach is as follows. The foundation is interleaving partitioning scheme, which generates a partitioned inverted file with interleaved mapping rule and produces a near-ideal speedup. The key idea of our proposed approach is to use the PBDIA algorithm to enhance the clustering property of posting lists for frequently used query terms before performing the interleaving partitioning scheme. This can aid the interleaving partitioning scheme to produce superior query performance. The results of this dissertation include: • For inverted file size reduction, the proposed coding method allows query throughput rate of approximately 30% higher than well-known Golomb coding and still provides superior compression. • For redundant decoding elimination, the proposed two-level skipped inverted file improves the query speed for conjunctive Boolean queries by up to 16%, and for ranked queries up to 44%, compared with the conventional one-level skipped inverted file. • For inverted file optimization, the PBDIA algorithm only takes a few seconds to generate a DIA for a collection of 1GB, and improves query speed by up to 25%. • For parallel IR, the proposed approach can further improve the parallel query speed for interleaving partitioning scheme by 14% to 17% no matter how many workstations are in the cluster.
author2 Jean Jyh-Jiun Shann
author_facet Jean Jyh-Jiun Shann
Cher-Sheng Cheng
鄭哲聖
author Cher-Sheng Cheng
鄭哲聖
spellingShingle Cher-Sheng Cheng
鄭哲聖
Inverted File Design for Large-Scale Information Retrieval System
author_sort Cher-Sheng Cheng
title Inverted File Design for Large-Scale Information Retrieval System
title_short Inverted File Design for Large-Scale Information Retrieval System
title_full Inverted File Design for Large-Scale Information Retrieval System
title_fullStr Inverted File Design for Large-Scale Information Retrieval System
title_full_unstemmed Inverted File Design for Large-Scale Information Retrieval System
title_sort inverted file design for large-scale information retrieval system
publishDate 2005
url http://ndltd.ncl.edu.tw/handle/35663252020207351257
work_keys_str_mv AT chershengcheng invertedfiledesignforlargescaleinformationretrievalsystem
AT zhèngzhéshèng invertedfiledesignforlargescaleinformationretrievalsystem
AT chershengcheng dàxíngzīxùnjiǎnsuǒxìtǒngzhīzhuǎnzhìdàngànshèjì
AT zhèngzhéshèng dàxíngzīxùnjiǎnsuǒxìtǒngzhīzhuǎnzhìdàngànshèjì
_version_ 1718294929673814016
spelling ndltd-TW-093NCTU53921262016-06-06T04:10:54Z http://ndltd.ncl.edu.tw/handle/35663252020207351257 Inverted File Design for Large-Scale Information Retrieval System 大型資訊檢索系統之轉置檔案設計 Cher-Sheng Cheng 鄭哲聖 博士 國立交通大學 資訊工程系所 93 This dissertation investigates a variety of techniques to improve efficiency in information retrieval (IR). Information retrieval systems (IRSs) are widely used in many applications, such as search engines, digital libraries, genomic sequence analyses, etc. To efficiently search vast amount of data, a compressed inverted file is used in an IRS to locate the desired data quickly. An inverted file contains, for each distinct term in the collection, a posting list. The query processing time of a large-scale IRS is dominated by the time needed to read and decompress the posting list for each query term. Moreover, adding a document into the collection is to add one document identifier into the posting list for each term appearing in the document, hence the length of a posting list increases with the size of document collection. This implies that the time needed to process posting lists increase as the size of document collection grows. Therefore, efficient approaches to reduce the time needed to read, decompress, and merge the posting lists are the key issues in designing a large-scale IRS. Research topics to be studied in this dissertation are (1) Efficient coding method for inverted file size reduction The first topic is to propose a novel size reduction method for compressing inverted files. Compressing an inverted file can greatly improve query performance by reducing disk I/Os, but this adds to the decompression time required. The objective of this topic is to develop a method that has both the advantages of compression ratio and fast decompression. Our approach is as follows. The foundation is interpolative coding, which compresses the document identifiers with a recursive process taking care of clustering property and yields superior compression. However, interpolative coding is computationally expensive due to a stack required in its implementation. The key idea of our proposed method is to facilitate coding and decoding processes for interpolative coding by using recursion elimination and loop unwinding. Experimental results show that our method provides fast decoding speed and excellent compression. (2) Two-level skipped inverted file for redundant decoding elimination The second topic is to propose a two-level skipped inverted file, in which a two-level skipped index is created on each compressed posting list, to reduce decompression time. A two-level skipped index can greatly reduce decompress time by skipping over unnecessary portions of the list. However, well-known skipping mechanisms are unable to efficiently implement the two-level skipped index due to their high storage overheads. The objective of this topic is to develop a space-economical two-level skipped inverted file to eliminate redundant decoding and allow fast query evaluation. For this purpose, we propose a novel skipping mechanism based on block size calculation, which can create a skipped index on each compressed posting list with very little or no storage overhead, particularly if the posting list is divided into very small blocks. Using a combination of our skipping mechanism and well-known skipping mechanisms can implement a two-level skipped index with very little storage overheads. Experimental results showed that using such a two-level skipped index can simultaneously allow extremely fast query processing of both conjunctive Boolean queries and ranked queries. (3) Document identifier assignment algorithm design for inverted file optimization The third topic is to propose a document identifier assignment (DIA) algorithm for fast query evaluation. We observe that a good DIA can make the document identifiers in the posting lists more clustered, and result in better compression as well as shorter query processing time. The objective of this topic is to develop a fast algorithm that finds an optimal DIA to minimize the average query processing time in an IRS. In a typical IRS, the distribution of query terms is skewed. Based on this fact, we propose a partition-based DIA (PBDIA) algorithm, which can efficiently assign consecutive document identifiers to those documents containing frequently used query terms. Therefore, the posting lists for frequently used query terms can be compressed better without increasing the complexity of decoding processes. This can result in reduced query processing time. (4) Inverted file partitioning for parallel IR The fourth topic is to propose an inverted file partitioning approach for parallel IR. The inverted file is generally partitioned into disjoint sub-files, each for one workstation, in an IRS that runs on a cluster of workstations. When processing a query, all workstations have to consult only their own sub-files in parallel. The objective of this topic is to develop an inverted file partitioning approach that minimizes the average query processing time of parallel query processing. Our approach is as follows. The foundation is interleaving partitioning scheme, which generates a partitioned inverted file with interleaved mapping rule and produces a near-ideal speedup. The key idea of our proposed approach is to use the PBDIA algorithm to enhance the clustering property of posting lists for frequently used query terms before performing the interleaving partitioning scheme. This can aid the interleaving partitioning scheme to produce superior query performance. The results of this dissertation include: • For inverted file size reduction, the proposed coding method allows query throughput rate of approximately 30% higher than well-known Golomb coding and still provides superior compression. • For redundant decoding elimination, the proposed two-level skipped inverted file improves the query speed for conjunctive Boolean queries by up to 16%, and for ranked queries up to 44%, compared with the conventional one-level skipped inverted file. • For inverted file optimization, the PBDIA algorithm only takes a few seconds to generate a DIA for a collection of 1GB, and improves query speed by up to 25%. • For parallel IR, the proposed approach can further improve the parallel query speed for interleaving partitioning scheme by 14% to 17% no matter how many workstations are in the cluster. Jean Jyh-Jiun Shann 單智君 2005 學位論文 ; thesis 135 en_US