A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning

碩士 === 國立中央大學 === 資訊工程研究所 === 97 === Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word seg...

Full description

Bibliographic Details
Main Authors:	Chieh-Cheng Yang, 楊傑程
Other Authors:	Chia-hui Chang
Format:	Others
Language:	zh-TW
Published:	2009
Online Access:	http://ndltd.ncl.edu.tw/handle/28083382194737352825

id	ndltd-TW-097NCU05392019
record_format	oai_dc
spelling	ndltd-TW-097NCU053920192016-05-02T04:12:04Z http://ndltd.ncl.edu.tw/handle/28083382194737352825 A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning 應用樣式探勘與機器學習方法於中文未知詞擷取之研究 Chieh-Cheng Yang 楊傑程碩士國立中央大學資訊工程研究所 97 Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach. Chia-hui Chang 張嘉惠 2009 學位論文 ; thesis 48 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中央大學 === 資訊工程研究所 === 97 === Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach.
author2	Chia-hui Chang
author_facet	Chia-hui Chang Chieh-Cheng Yang 楊傑程
author	Chieh-Cheng Yang 楊傑程
spellingShingle	Chieh-Cheng Yang 楊傑程 A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
author_sort	Chieh-Cheng Yang
title	A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
title_short	A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
title_full	A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
title_fullStr	A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
title_full_unstemmed	A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
title_sort	two-phase approach to chinese unknown word extraction: application of pattern mining and machine learning
publishDate	2009
url	http://ndltd.ncl.edu.tw/handle/28083382194737352825
work_keys_str_mv	AT chiehchengyang atwophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning AT yángjiéchéng atwophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning AT chiehchengyang yīngyòngyàngshìtànkānyǔjīqìxuéxífāngfǎyúzhōngwénwèizhīcíxiéqǔzhīyánjiū AT yángjiéchéng yīngyòngyàngshìtànkānyǔjīqìxuéxífāngfǎyúzhōngwénwèizhīcíxiéqǔzhīyánjiū AT chiehchengyang twophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning AT yángjiéchéng twophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning
_version_	1718254603682709504

A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning

Similar Items