Vyhledávání v českých strukturovaných datech pomocí stemmingu

This work describes and implements a component for fulltext searching with czech diacritics restoration and stemming support. Diacritics restoration is based on statistical principles and is context dependent. This work presents ve stemmers ready for immediate use (two algorithmic stemmers and three...

Full description

Bibliographic Details
Main Author: Tattermusch, Jan
Other Authors: Hlaváčová, Jaroslava
Format: Dissertation
Language:Czech
Published: 2010
Online Access:http://www.nusl.cz/ntk/nusl-298466
id ndltd-nusl.cz-oai-invenio.nusl.cz-298466
record_format oai_dc
spelling ndltd-nusl.cz-oai-invenio.nusl.cz-2984662017-06-27T04:42:44Z Vyhledávání v českých strukturovaných datech pomocí stemmingu Searching Czech Structured Data using Stemming Tattermusch, Jan Hlaváčová, Jaroslava Kuboň, Vladislav This work describes and implements a component for fulltext searching with czech diacritics restoration and stemming support. Diacritics restoration is based on statistical principles and is context dependent. This work presents ve stemmers ready for immediate use (two algorithmic stemmers and three hybrid stemmers) and discusses their properties. The component is implemented using Apache Lucene library and provides a simple interface for querying and insertions, deletions and updates of documents indexed. Stored documents consist of named elds with prede ned data types. Besides regular fulltext queries, the component also supports non-trivial queries with additional constraints and provides a way to customize the way query result score is computed. Component's performance is suffcient for medium-load applications and is approximately 50 queries per second with a repository that contains 2.7 million documents. Contribution of stemming and diacritics restoration to the quality of fulltext searching was measured using MAP and is signi cant. 2010 info:eu-repo/semantics/masterThesis http://www.nusl.cz/ntk/nusl-298466 cze info:eu-repo/semantics/restrictedAccess
collection NDLTD
language Czech
format Dissertation
sources NDLTD
description This work describes and implements a component for fulltext searching with czech diacritics restoration and stemming support. Diacritics restoration is based on statistical principles and is context dependent. This work presents ve stemmers ready for immediate use (two algorithmic stemmers and three hybrid stemmers) and discusses their properties. The component is implemented using Apache Lucene library and provides a simple interface for querying and insertions, deletions and updates of documents indexed. Stored documents consist of named elds with prede ned data types. Besides regular fulltext queries, the component also supports non-trivial queries with additional constraints and provides a way to customize the way query result score is computed. Component's performance is suffcient for medium-load applications and is approximately 50 queries per second with a repository that contains 2.7 million documents. Contribution of stemming and diacritics restoration to the quality of fulltext searching was measured using MAP and is signi cant.
author2 Hlaváčová, Jaroslava
author_facet Hlaváčová, Jaroslava
Tattermusch, Jan
author Tattermusch, Jan
spellingShingle Tattermusch, Jan
Vyhledávání v českých strukturovaných datech pomocí stemmingu
author_sort Tattermusch, Jan
title Vyhledávání v českých strukturovaných datech pomocí stemmingu
title_short Vyhledávání v českých strukturovaných datech pomocí stemmingu
title_full Vyhledávání v českých strukturovaných datech pomocí stemmingu
title_fullStr Vyhledávání v českých strukturovaných datech pomocí stemmingu
title_full_unstemmed Vyhledávání v českých strukturovaných datech pomocí stemmingu
title_sort vyhledávání v českých strukturovaných datech pomocí stemmingu
publishDate 2010
url http://www.nusl.cz/ntk/nusl-298466
work_keys_str_mv AT tattermuschjan vyhledavanivceskychstrukturovanychdatechpomocistemmingu
AT tattermuschjan searchingczechstructureddatausingstemming
_version_ 1718471373415776256