A Study on Syntax-Based Word Identification for Mandarin Chinese

碩士 === 淡江大學 === 資訊工程研究所 === 82 === A Chinese sentence is composed of a string of Chinese characters. However, the basic unit for parsing and analysis of a Chinese sentence is not a Chinese character but a Chinese word, the smallest meaningful unit. Becaus...

Full description

Bibliographic Details
Main Authors: Wang Sheng Chung, 王聖中
Other Authors: Horng Wen Bing
Format: Others
Language:zh-TW
Published: 1994
Online Access:http://ndltd.ncl.edu.tw/handle/83823890561636041552
Description
Summary:碩士 === 淡江大學 === 資訊工程研究所 === 82 === A Chinese sentence is composed of a string of Chinese characters. However, the basic unit for parsing and analysis of a Chinese sentence is not a Chinese character but a Chinese word, the smallest meaningful unit. Because there are no explicit blanks to mark words in a sentence, word identification turns out to be a crucial preprocessing step in Chinese natural language processing. In this paper, a new Chinese word identification algorithm based on syntax is proposed. According to syntax rules, a Chinese sentence is segmented into meaningful words and the ambiguity of word identification is resolved properly. The algorithm consists of three phases: In phase one, the lexicon is used to pre-identify all the possible word segmentations. Second, morphological rules, including general rules and determinative-measure compound rules, are employed to improve the correctness of word segmentations. Finally, a chart parser is implemented for identifying the word segmentations which agree with given syntax. In order to improve the correctness of word segmentation, the sentence structure for predicates are also provided. According to the experimental results, the recall rate of words is 85.64% and the precision rate of words is 94.31%. In analyzing the low recall rate, we discovered that there are three major reasons which increase error rate in the processing of word identification: First, there exist unknown words in test data which are not present in the lexicon. Second, the category information provided in the lexicon is not complete. Third, the syntax we obtained is not enough.