Automatic Supervised Thesauri Construction with Roget’s Thesaurus

Thesauri are important tools for many Natural Language Processing applications. Roget's Thesaurus is particularly useful. It is of high quality and has been in development for over a century and a half. Yet its applications have been limited, largely because the only publicly available edition...

Full description

Bibliographic Details
Main Author: Kennedy, Alistair H
Other Authors: Szpakowicz, Stan
Language:en
Published: Université d'Ottawa / University of Ottawa 2012
Subjects:
Online Access:http://hdl.handle.net/10393/23573
http://dx.doi.org/10.20381/ruor-6250
id ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-23573
record_format oai_dc
spelling ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-235732018-01-05T19:01:27Z Automatic Supervised Thesauri Construction with Roget’s Thesaurus Kennedy, Alistair H Szpakowicz, Stan Roget's Thesaurus Natural Language Processing Distributional Semantics Thesauri construction Thesauri are important tools for many Natural Language Processing applications. Roget's Thesaurus is particularly useful. It is of high quality and has been in development for over a century and a half. Yet its applications have been limited, largely because the only publicly available edition dates from 1911. This thesis proposes and tests methods of automatically updating the vocabulary of the 1911 Roget’s Thesaurus. I use the Thesaurus as a source of training data in order to learn from Roget’s for the purpose of updating Roget’s. The lexicon is updated in two stages. First, I develop a measure of semantic relatedness that enhances existing distributional techniques. I improve existing methods by using known sets of synonyms from Roget’s to train a distributional measure to better identify near synonyms. Second, I use the new measure of semantic relatedness to find where in Roget’s to place a new word. Existing words from Roget’s are used as training data to tune the parameters of three methods of inserting words. Over 5000 new words and word-senses were added using this process. I conduct two kinds of evaluation on the updated Thesaurus. One is on the procedure for updating Roget’s. This is accomplished by removing some words from the Thesaurus and testing my system's ability to reinsert them in the correct location. Human evaluation of the newly added words is also performed. Annotators must determine whether a newly added word is in the correct location. They found that in most cases the new words were almost indistinguishable from those already existing in Roget's Thesaurus. The second kind of evaluation is to establish the usefulness of the updated Roget’s Thesaurus on actual Natural Language Processing applications. These applications include determining semantic relatedness between word pairs or sentence pairs, identifying the best synonym from a set of candidates, solving SAT-style analogy problems, pseudo-word-sense disambiguation, and sentence ranking for text summarization. The updated Thesaurus consistently performed at least as well or better the original Thesaurus on all these applications. 2012-12-07T19:08:03Z 2012-12-07T19:08:03Z 2012 2012 Thesis http://hdl.handle.net/10393/23573 http://dx.doi.org/10.20381/ruor-6250 en Université d'Ottawa / University of Ottawa
collection NDLTD
language en
sources NDLTD
topic Roget's Thesaurus
Natural Language Processing
Distributional Semantics
Thesauri construction
spellingShingle Roget's Thesaurus
Natural Language Processing
Distributional Semantics
Thesauri construction
Kennedy, Alistair H
Automatic Supervised Thesauri Construction with Roget’s Thesaurus
description Thesauri are important tools for many Natural Language Processing applications. Roget's Thesaurus is particularly useful. It is of high quality and has been in development for over a century and a half. Yet its applications have been limited, largely because the only publicly available edition dates from 1911. This thesis proposes and tests methods of automatically updating the vocabulary of the 1911 Roget’s Thesaurus. I use the Thesaurus as a source of training data in order to learn from Roget’s for the purpose of updating Roget’s. The lexicon is updated in two stages. First, I develop a measure of semantic relatedness that enhances existing distributional techniques. I improve existing methods by using known sets of synonyms from Roget’s to train a distributional measure to better identify near synonyms. Second, I use the new measure of semantic relatedness to find where in Roget’s to place a new word. Existing words from Roget’s are used as training data to tune the parameters of three methods of inserting words. Over 5000 new words and word-senses were added using this process. I conduct two kinds of evaluation on the updated Thesaurus. One is on the procedure for updating Roget’s. This is accomplished by removing some words from the Thesaurus and testing my system's ability to reinsert them in the correct location. Human evaluation of the newly added words is also performed. Annotators must determine whether a newly added word is in the correct location. They found that in most cases the new words were almost indistinguishable from those already existing in Roget's Thesaurus. The second kind of evaluation is to establish the usefulness of the updated Roget’s Thesaurus on actual Natural Language Processing applications. These applications include determining semantic relatedness between word pairs or sentence pairs, identifying the best synonym from a set of candidates, solving SAT-style analogy problems, pseudo-word-sense disambiguation, and sentence ranking for text summarization. The updated Thesaurus consistently performed at least as well or better the original Thesaurus on all these applications.
author2 Szpakowicz, Stan
author_facet Szpakowicz, Stan
Kennedy, Alistair H
author Kennedy, Alistair H
author_sort Kennedy, Alistair H
title Automatic Supervised Thesauri Construction with Roget’s Thesaurus
title_short Automatic Supervised Thesauri Construction with Roget’s Thesaurus
title_full Automatic Supervised Thesauri Construction with Roget’s Thesaurus
title_fullStr Automatic Supervised Thesauri Construction with Roget’s Thesaurus
title_full_unstemmed Automatic Supervised Thesauri Construction with Roget’s Thesaurus
title_sort automatic supervised thesauri construction with roget’s thesaurus
publisher Université d'Ottawa / University of Ottawa
publishDate 2012
url http://hdl.handle.net/10393/23573
http://dx.doi.org/10.20381/ruor-6250
work_keys_str_mv AT kennedyalistairh automaticsupervisedthesauriconstructionwithrogetsthesaurus
_version_ 1718597680329916416