A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
碩士 === 國立中央大學 === 資訊工程研究所 === 97 === Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word seg...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2009
|
Online Access: | http://ndltd.ncl.edu.tw/handle/28083382194737352825 |
id |
ndltd-TW-097NCU05392019 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-097NCU053920192016-05-02T04:12:04Z http://ndltd.ncl.edu.tw/handle/28083382194737352825 A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning 應用樣式探勘與機器學習方法於中文未知詞擷取之研究 Chieh-Cheng Yang 楊傑程 碩士 國立中央大學 資訊工程研究所 97 Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach. Chia-hui Chang 張嘉惠 2009 學位論文 ; thesis 48 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中央大學 === 資訊工程研究所 === 97 === Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach.
|
author2 |
Chia-hui Chang |
author_facet |
Chia-hui Chang Chieh-Cheng Yang 楊傑程 |
author |
Chieh-Cheng Yang 楊傑程 |
spellingShingle |
Chieh-Cheng Yang 楊傑程 A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning |
author_sort |
Chieh-Cheng Yang |
title |
A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning |
title_short |
A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning |
title_full |
A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning |
title_fullStr |
A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning |
title_full_unstemmed |
A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning |
title_sort |
two-phase approach to chinese unknown word extraction: application of pattern mining and machine learning |
publishDate |
2009 |
url |
http://ndltd.ncl.edu.tw/handle/28083382194737352825 |
work_keys_str_mv |
AT chiehchengyang atwophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning AT yángjiéchéng atwophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning AT chiehchengyang yīngyòngyàngshìtànkānyǔjīqìxuéxífāngfǎyúzhōngwénwèizhīcíxiéqǔzhīyánjiū AT yángjiéchéng yīngyòngyàngshìtànkānyǔjīqìxuéxífāngfǎyúzhōngwénwèizhīcíxiéqǔzhīyánjiū AT chiehchengyang twophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning AT yángjiéchéng twophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning |
_version_ |
1718254603682709504 |