A Practical q -Gram Index for Text Retrieval Allowing Errors

We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substr...

Full description

Bibliographic Details
Main Authors:	Gonzalo Navarro, Ricardo Baeza-Yates
Format:	Article
Language:	English
Published:	Centro Latinoamericano de Estudios en Informática 2018-09-01
Series:	CLEI Electronic Journal
Online Access:	http://clei.org/cleiej-beta/index.php/cleiej/article/view/379

id	doaj-dee15db6434c434e9fbdd377beb9b863
record_format	Article
spelling	doaj-dee15db6434c434e9fbdd377beb9b8632020-11-25T02:34:21ZengCentro Latinoamericano de Estudios en InformáticaCLEI Electronic Journal0717-50002018-09-011210.19153/cleiej.1.2.3A Practical q -Gram Index for Text Retrieval Allowing ErrorsGonzalo Navarro0Ricardo Baeza-Yates1Depto. de Ciencias de la ComputaciDepto. de Ciencias de la Computacion, U. de Chile We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the building time, space requirements and querying time of our index, finding that it is a practical alternative for text retrieval. The retrieval times are reduced from 10% to 60% of the best on-line algorithm. http://clei.org/cleiej-beta/index.php/cleiej/article/view/379
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Gonzalo Navarro Ricardo Baeza-Yates
spellingShingle	Gonzalo Navarro Ricardo Baeza-Yates A Practical q -Gram Index for Text Retrieval Allowing Errors CLEI Electronic Journal
author_facet	Gonzalo Navarro Ricardo Baeza-Yates
author_sort	Gonzalo Navarro
title	A Practical q -Gram Index for Text Retrieval Allowing Errors
title_short	A Practical q -Gram Index for Text Retrieval Allowing Errors
title_full	A Practical q -Gram Index for Text Retrieval Allowing Errors
title_fullStr	A Practical q -Gram Index for Text Retrieval Allowing Errors
title_full_unstemmed	A Practical q -Gram Index for Text Retrieval Allowing Errors
title_sort	practical q -gram index for text retrieval allowing errors
publisher	Centro Latinoamericano de Estudios en Informática
series	CLEI Electronic Journal
issn	0717-5000
publishDate	2018-09-01
description	We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the building time, space requirements and querying time of our index, finding that it is a practical alternative for text retrieval. The retrieval times are reduced from 10% to 60% of the best on-line algorithm.
url	http://clei.org/cleiej-beta/index.php/cleiej/article/view/379
work_keys_str_mv	AT gonzalonavarro apracticalqgramindexfortextretrievalallowingerrors AT ricardobaezayates apracticalqgramindexfortextretrievalallowingerrors AT gonzalonavarro practicalqgramindexfortextretrievalallowingerrors AT ricardobaezayates practicalqgramindexfortextretrievalallowingerrors
_version_	1724809467135000576

A Practical q -Gram Index for Text Retrieval Allowing Errors

Similar Items