Summary: | 碩士 === 淡江大學 === 資訊工程研究所 === 82 === A Chinese sentence is composed of a string of Chinese
characters. However, the basic unit for parsing and analysis of
a Chinese sentence is not a Chinese character but a Chinese
word, the smallest meaningful unit. Because there are no
explicit blanks to mark words in a sentence, word
identification turns out to be a crucial preprocessing step in
Chinese natural language processing. In this paper, a new
Chinese word identification algorithm based on syntax is
proposed. According to syntax rules, a Chinese sentence is
segmented into meaningful words and the ambiguity of word
identification is resolved properly. The algorithm consists of
three phases: In phase one, the lexicon is used to pre-identify
all the possible word segmentations. Second, morphological
rules, including general rules and determinative-measure
compound rules, are employed to improve the correctness of word
segmentations. Finally, a chart parser is implemented for
identifying the word segmentations which agree with given
syntax. In order to improve the correctness of word
segmentation, the sentence structure for predicates are also
provided. According to the experimental results, the recall
rate of words is 85.64% and the precision rate of words is
94.31%. In analyzing the low recall rate, we discovered that
there are three major reasons which increase error rate in the
processing of word identification: First, there exist unknown
words in test data which are not present in the lexicon.
Second, the category information provided in the lexicon is not
complete. Third, the syntax we obtained is not enough.
|