A Study on Syntax-Based Word Identification for Mandarin Chinese

碩士 === 淡江大學 === 資訊工程研究所 === 82 === A Chinese sentence is composed of a string of Chinese characters. However, the basic unit for parsing and analysis of a Chinese sentence is not a Chinese character but a Chinese word, the smallest meaningful unit. Becaus...

Full description

Bibliographic Details
Main Authors: Wang Sheng Chung, 王聖中
Other Authors: Horng Wen Bing
Format: Others
Language:zh-TW
Published: 1994
Online Access:http://ndltd.ncl.edu.tw/handle/83823890561636041552
id ndltd-TW-082TKU00392015
record_format oai_dc
spelling ndltd-TW-082TKU003920152016-02-08T04:06:32Z http://ndltd.ncl.edu.tw/handle/83823890561636041552 A Study on Syntax-Based Word Identification for Mandarin Chinese 語法式中文斷詞之研究 Wang Sheng Chung 王聖中 碩士 淡江大學 資訊工程研究所 82 A Chinese sentence is composed of a string of Chinese characters. However, the basic unit for parsing and analysis of a Chinese sentence is not a Chinese character but a Chinese word, the smallest meaningful unit. Because there are no explicit blanks to mark words in a sentence, word identification turns out to be a crucial preprocessing step in Chinese natural language processing. In this paper, a new Chinese word identification algorithm based on syntax is proposed. According to syntax rules, a Chinese sentence is segmented into meaningful words and the ambiguity of word identification is resolved properly. The algorithm consists of three phases: In phase one, the lexicon is used to pre-identify all the possible word segmentations. Second, morphological rules, including general rules and determinative-measure compound rules, are employed to improve the correctness of word segmentations. Finally, a chart parser is implemented for identifying the word segmentations which agree with given syntax. In order to improve the correctness of word segmentation, the sentence structure for predicates are also provided. According to the experimental results, the recall rate of words is 85.64% and the precision rate of words is 94.31%. In analyzing the low recall rate, we discovered that there are three major reasons which increase error rate in the processing of word identification: First, there exist unknown words in test data which are not present in the lexicon. Second, the category information provided in the lexicon is not complete. Third, the syntax we obtained is not enough. Horng Wen Bing 洪文斌 1994 學位論文 ; thesis 93 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 淡江大學 === 資訊工程研究所 === 82 === A Chinese sentence is composed of a string of Chinese characters. However, the basic unit for parsing and analysis of a Chinese sentence is not a Chinese character but a Chinese word, the smallest meaningful unit. Because there are no explicit blanks to mark words in a sentence, word identification turns out to be a crucial preprocessing step in Chinese natural language processing. In this paper, a new Chinese word identification algorithm based on syntax is proposed. According to syntax rules, a Chinese sentence is segmented into meaningful words and the ambiguity of word identification is resolved properly. The algorithm consists of three phases: In phase one, the lexicon is used to pre-identify all the possible word segmentations. Second, morphological rules, including general rules and determinative-measure compound rules, are employed to improve the correctness of word segmentations. Finally, a chart parser is implemented for identifying the word segmentations which agree with given syntax. In order to improve the correctness of word segmentation, the sentence structure for predicates are also provided. According to the experimental results, the recall rate of words is 85.64% and the precision rate of words is 94.31%. In analyzing the low recall rate, we discovered that there are three major reasons which increase error rate in the processing of word identification: First, there exist unknown words in test data which are not present in the lexicon. Second, the category information provided in the lexicon is not complete. Third, the syntax we obtained is not enough.
author2 Horng Wen Bing
author_facet Horng Wen Bing
Wang Sheng Chung
王聖中
author Wang Sheng Chung
王聖中
spellingShingle Wang Sheng Chung
王聖中
A Study on Syntax-Based Word Identification for Mandarin Chinese
author_sort Wang Sheng Chung
title A Study on Syntax-Based Word Identification for Mandarin Chinese
title_short A Study on Syntax-Based Word Identification for Mandarin Chinese
title_full A Study on Syntax-Based Word Identification for Mandarin Chinese
title_fullStr A Study on Syntax-Based Word Identification for Mandarin Chinese
title_full_unstemmed A Study on Syntax-Based Word Identification for Mandarin Chinese
title_sort study on syntax-based word identification for mandarin chinese
publishDate 1994
url http://ndltd.ncl.edu.tw/handle/83823890561636041552
work_keys_str_mv AT wangshengchung astudyonsyntaxbasedwordidentificationformandarinchinese
AT wángshèngzhōng astudyonsyntaxbasedwordidentificationformandarinchinese
AT wangshengchung yǔfǎshìzhōngwénduàncízhīyánjiū
AT wángshèngzhōng yǔfǎshìzhōngwénduàncízhīyánjiū
AT wangshengchung studyonsyntaxbasedwordidentificationformandarinchinese
AT wángshèngzhōng studyonsyntaxbasedwordidentificationformandarinchinese
_version_ 1718182611905413120