Summary: | 碩士 === 國立政治大學 === 資訊科學學系 === 104 === Difangzhi is the local gazetteers compiled by local government of China. Its content is plenty and extensive. It’s including many undetected information, like biographical information, geographical information, and officer record information and so on. Because of the difference between Difangzhi corpus and modern Chinese language, we should not use current natural language processing tools directly. In order to extract biographical information, we construct our model to recognize the named entity and use the noun list to assist our annotation method in Difangzhi corpus.
In this study, we use supervised learning to construct our model. At first, we need to generate our training data. According to the personal information list with manual verification and noun lists, we have reliable information to annotate words in Difangzhi corpus. However, they still have some noise in those lists. As a result, we must do the preprocessing to those lists for cleaning. After, the ambiguity problem will happen when we trying to annotate our corpus. Here we provide three methods to annotate our corpus with disambiguation. Using the annotated corpus to generate training data and built the condition random fields models. In our experiment, we use our models generated by three different annotate methods to predict the character label in testing Difangzhi corpus. According to the labeled result, we extract the person name and address name to evaluate. The result shows the precision of person name recognition is over 80%, and precision of address name recognition is about 86%. Because of the training corpus and test corpus is quite similar, the performances of our model is pretty well. Therefore, we use labeled result to find correlation of person name and address name. Using a simple way to connect person name and address name and sampling the result to evaluate. The sample result shows we could connect person name and address name correctly in some specific grammars. In order to analyze more deeply, we attempt to split clauses in Difangzhi corpus. Use finite state machine model to recognize the beginning of clauses. Although the result shows we could find some beginning of clauses, but our method still lose many beginning of clauses.
In the future work, we attempt to add more information to annotate Difangzhi corpus and modify our disambiguated methods to make the recognition result better. In order to get more information about the person in the corpus, we will try to split paragraphs or sentences more precisely. Besides, we also try to analyze grammar in the corpus. Finding useful pattern to connect person name and other entities, like address name, officer name and so on. Generating the information about people appears in the corpus automatically.
|