A Study on Syntax-Based Word Identification for Mandarin Chinese
碩士 === 淡江大學 === 資訊工程研究所 === 82 === A Chinese sentence is composed of a string of Chinese characters. However, the basic unit for parsing and analysis of a Chinese sentence is not a Chinese character but a Chinese word, the smallest meaningful unit. Becaus...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
1994
|
Online Access: | http://ndltd.ncl.edu.tw/handle/83823890561636041552 |
id |
ndltd-TW-082TKU00392015 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-082TKU003920152016-02-08T04:06:32Z http://ndltd.ncl.edu.tw/handle/83823890561636041552 A Study on Syntax-Based Word Identification for Mandarin Chinese 語法式中文斷詞之研究 Wang Sheng Chung 王聖中 碩士 淡江大學 資訊工程研究所 82 A Chinese sentence is composed of a string of Chinese characters. However, the basic unit for parsing and analysis of a Chinese sentence is not a Chinese character but a Chinese word, the smallest meaningful unit. Because there are no explicit blanks to mark words in a sentence, word identification turns out to be a crucial preprocessing step in Chinese natural language processing. In this paper, a new Chinese word identification algorithm based on syntax is proposed. According to syntax rules, a Chinese sentence is segmented into meaningful words and the ambiguity of word identification is resolved properly. The algorithm consists of three phases: In phase one, the lexicon is used to pre-identify all the possible word segmentations. Second, morphological rules, including general rules and determinative-measure compound rules, are employed to improve the correctness of word segmentations. Finally, a chart parser is implemented for identifying the word segmentations which agree with given syntax. In order to improve the correctness of word segmentation, the sentence structure for predicates are also provided. According to the experimental results, the recall rate of words is 85.64% and the precision rate of words is 94.31%. In analyzing the low recall rate, we discovered that there are three major reasons which increase error rate in the processing of word identification: First, there exist unknown words in test data which are not present in the lexicon. Second, the category information provided in the lexicon is not complete. Third, the syntax we obtained is not enough. Horng Wen Bing 洪文斌 1994 學位論文 ; thesis 93 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 淡江大學 === 資訊工程研究所 === 82 === A Chinese sentence is composed of a string of Chinese
characters. However, the basic unit for parsing and analysis of
a Chinese sentence is not a Chinese character but a Chinese
word, the smallest meaningful unit. Because there are no
explicit blanks to mark words in a sentence, word
identification turns out to be a crucial preprocessing step in
Chinese natural language processing. In this paper, a new
Chinese word identification algorithm based on syntax is
proposed. According to syntax rules, a Chinese sentence is
segmented into meaningful words and the ambiguity of word
identification is resolved properly. The algorithm consists of
three phases: In phase one, the lexicon is used to pre-identify
all the possible word segmentations. Second, morphological
rules, including general rules and determinative-measure
compound rules, are employed to improve the correctness of word
segmentations. Finally, a chart parser is implemented for
identifying the word segmentations which agree with given
syntax. In order to improve the correctness of word
segmentation, the sentence structure for predicates are also
provided. According to the experimental results, the recall
rate of words is 85.64% and the precision rate of words is
94.31%. In analyzing the low recall rate, we discovered that
there are three major reasons which increase error rate in the
processing of word identification: First, there exist unknown
words in test data which are not present in the lexicon.
Second, the category information provided in the lexicon is not
complete. Third, the syntax we obtained is not enough.
|
author2 |
Horng Wen Bing |
author_facet |
Horng Wen Bing Wang Sheng Chung 王聖中 |
author |
Wang Sheng Chung 王聖中 |
spellingShingle |
Wang Sheng Chung 王聖中 A Study on Syntax-Based Word Identification for Mandarin Chinese |
author_sort |
Wang Sheng Chung |
title |
A Study on Syntax-Based Word Identification for Mandarin Chinese |
title_short |
A Study on Syntax-Based Word Identification for Mandarin Chinese |
title_full |
A Study on Syntax-Based Word Identification for Mandarin Chinese |
title_fullStr |
A Study on Syntax-Based Word Identification for Mandarin Chinese |
title_full_unstemmed |
A Study on Syntax-Based Word Identification for Mandarin Chinese |
title_sort |
study on syntax-based word identification for mandarin chinese |
publishDate |
1994 |
url |
http://ndltd.ncl.edu.tw/handle/83823890561636041552 |
work_keys_str_mv |
AT wangshengchung astudyonsyntaxbasedwordidentificationformandarinchinese AT wángshèngzhōng astudyonsyntaxbasedwordidentificationformandarinchinese AT wangshengchung yǔfǎshìzhōngwénduàncízhīyánjiū AT wángshèngzhōng yǔfǎshìzhōngwénduàncízhīyánjiū AT wangshengchung studyonsyntaxbasedwordidentificationformandarinchinese AT wángshèngzhōng studyonsyntaxbasedwordidentificationformandarinchinese |
_version_ |
1718182611905413120 |