A Study on Syntax-Based Word Identification for Mandarin Chinese

碩士 === 淡江大學 === 資訊工程研究所 === 82 === A Chinese sentence is composed of a string of Chinese characters. However, the basic unit for parsing and analysis of a Chinese sentence is not a Chinese character but a Chinese word, the smallest meaningful unit. Becaus...

Full description

Bibliographic Details
Main Authors:	Wang Sheng Chung, 王聖中
Other Authors:	Horng Wen Bing
Format:	Others
Language:	zh-TW
Published:	1994
Online Access:	http://ndltd.ncl.edu.tw/handle/83823890561636041552

id	ndltd-TW-082TKU00392015
record_format	oai_dc
spelling	ndltd-TW-082TKU003920152016-02-08T04:06:32Z http://ndltd.ncl.edu.tw/handle/83823890561636041552 A Study on Syntax-Based Word Identification for Mandarin Chinese 語法式中文斷詞之研究 Wang Sheng Chung 王聖中碩士淡江大學資訊工程研究所 82 A Chinese sentence is composed of a string of Chinese characters. However, the basic unit for parsing and analysis of a Chinese sentence is not a Chinese character but a Chinese word, the smallest meaningful unit. Because there are no explicit blanks to mark words in a sentence, word identification turns out to be a crucial preprocessing step in Chinese natural language processing. In this paper, a new Chinese word identification algorithm based on syntax is proposed. According to syntax rules, a Chinese sentence is segmented into meaningful words and the ambiguity of word identification is resolved properly. The algorithm consists of three phases: In phase one, the lexicon is used to pre-identify all the possible word segmentations. Second, morphological rules, including general rules and determinative-measure compound rules, are employed to improve the correctness of word segmentations. Finally, a chart parser is implemented for identifying the word segmentations which agree with given syntax. In order to improve the correctness of word segmentation, the sentence structure for predicates are also provided. According to the experimental results, the recall rate of words is 85.64% and the precision rate of words is 94.31%. In analyzing the low recall rate, we discovered that there are three major reasons which increase error rate in the processing of word identification: First, there exist unknown words in test data which are not present in the lexicon. Second, the category information provided in the lexicon is not complete. Third, the syntax we obtained is not enough. Horng Wen Bing 洪文斌 1994 學位論文 ; thesis 93 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 淡江大學 === 資訊工程研究所 === 82 === A Chinese sentence is composed of a string of Chinese characters. However, the basic unit for parsing and analysis of a Chinese sentence is not a Chinese character but a Chinese word, the smallest meaningful unit. Because there are no explicit blanks to mark words in a sentence, word identification turns out to be a crucial preprocessing step in Chinese natural language processing. In this paper, a new Chinese word identification algorithm based on syntax is proposed. According to syntax rules, a Chinese sentence is segmented into meaningful words and the ambiguity of word identification is resolved properly. The algorithm consists of three phases: In phase one, the lexicon is used to pre-identify all the possible word segmentations. Second, morphological rules, including general rules and determinative-measure compound rules, are employed to improve the correctness of word segmentations. Finally, a chart parser is implemented for identifying the word segmentations which agree with given syntax. In order to improve the correctness of word segmentation, the sentence structure for predicates are also provided. According to the experimental results, the recall rate of words is 85.64% and the precision rate of words is 94.31%. In analyzing the low recall rate, we discovered that there are three major reasons which increase error rate in the processing of word identification: First, there exist unknown words in test data which are not present in the lexicon. Second, the category information provided in the lexicon is not complete. Third, the syntax we obtained is not enough.
author2	Horng Wen Bing
author_facet	Horng Wen Bing Wang Sheng Chung 王聖中
author	Wang Sheng Chung 王聖中
spellingShingle	Wang Sheng Chung 王聖中 A Study on Syntax-Based Word Identification for Mandarin Chinese
author_sort	Wang Sheng Chung
title	A Study on Syntax-Based Word Identification for Mandarin Chinese
title_short	A Study on Syntax-Based Word Identification for Mandarin Chinese
title_full	A Study on Syntax-Based Word Identification for Mandarin Chinese
title_fullStr	A Study on Syntax-Based Word Identification for Mandarin Chinese
title_full_unstemmed	A Study on Syntax-Based Word Identification for Mandarin Chinese
title_sort	study on syntax-based word identification for mandarin chinese
publishDate	1994
url	http://ndltd.ncl.edu.tw/handle/83823890561636041552
work_keys_str_mv	AT wangshengchung astudyonsyntaxbasedwordidentificationformandarinchinese AT wángshèngzhōng astudyonsyntaxbasedwordidentificationformandarinchinese AT wangshengchung yǔfǎshìzhōngwénduàncízhīyánjiū AT wángshèngzhōng yǔfǎshìzhōngwénduàncízhīyánjiū AT wangshengchung studyonsyntaxbasedwordidentificationformandarinchinese AT wángshèngzhōng studyonsyntaxbasedwordidentificationformandarinchinese
_version_	1718182611905413120

A Study on Syntax-Based Word Identification for Mandarin Chinese

Similar Items