Creating Gazetteers of Artifact Entities by Bootstrapping Method

碩士 === 國立臺灣海洋大學 === 資訊工程學系 === 95 === The thesis studies on the artifact named entity recognition. 10 types of different artifacts were defined. There are abundant resources for the MOVIE artifacts, so it is more easily to verify the performance of a NE recognizer. Starting from the MOIVE name re...

Full description

Bibliographic Details
Main Authors: SHUN-CHIEN CHENG, 鄭順謙
Other Authors: Chuan-Jie Lin
Format: Others
Language:zh-TW
Published: 2007
Online Access:http://ndltd.ncl.edu.tw/handle/88621042775914595807
Description
Summary:碩士 === 國立臺灣海洋大學 === 資訊工程學系 === 95 === The thesis studies on the artifact named entity recognition. 10 types of different artifacts were defined. There are abundant resources for the MOVIE artifacts, so it is more easily to verify the performance of a NE recognizer. Starting from the MOIVE name recognition, a useful method was proposed to create gazetteers from the Internet by a bootstrapping algorithm. This method can be extended to the other artifact types. By using a gazetteer constructed by hand, the gazetteer lookup method checking all the quoted strings in development set achieved an F-measure of 48.2%. The idea of “companion gazetteer” was proposed. For MOVIE names, a companion gazetteer is a list of names of persons who are relevant to the movie industry. After finding new elements of the MOVIE gazetteer and the movie-related PERSON gazetteer by the proposed bootstrapping algorithm, recognizing movie names in the development set achieved an F-measure of 62.6% by using the final version of the MOVIE gazetteer. The data set used to create a MOVIE gazetteer was then shifted to the Internet. Two filtering rules were proposed in the bootstrapping algorithm in order to select more accurate movies names and movie-related person names. When a candidate string was not collected in the gazetteer, it would be judged immediately from the Internet resources following the filtering rules. By using such a method to identify MOVIE names, it achieved an F-measure are 69.6% in the development set, and 55.6% in the test set, which were the best results in this thesis. The idea of using context feature terms was also proposed. But it has been proved not a good solution. By using condition probability, corpus frequency, chi-square, setting context window as the whole document or a 40-word passage, some context feature terms were selected accordingly. Two features corresponding to each feature term were then used to train a MOVIE name identifier by machine learning. The best performance was an F-Measure of 63.7% in the development set when the context window was set to be the whole document and the chi-square values were used to select top 100 context feature terms. A bootstrapping method was also proposed to select context feature terms for machine learning. By using the list of movie names in the development set as a seed, the best F-measure is 64.9% when some filtering rules were applied.