A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning

碩士 === 國立中央大學 === 資訊工程研究所 === 97 === Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word seg...

Full description

Bibliographic Details
Main Authors: Chieh-Cheng Yang, 楊傑程
Other Authors: Chia-hui Chang
Format: Others
Language:zh-TW
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/28083382194737352825
id ndltd-TW-097NCU05392019
record_format oai_dc
spelling ndltd-TW-097NCU053920192016-05-02T04:12:04Z http://ndltd.ncl.edu.tw/handle/28083382194737352825 A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning 應用樣式探勘與機器學習方法於中文未知詞擷取之研究 Chieh-Cheng Yang 楊傑程 碩士 國立中央大學 資訊工程研究所 97 Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach. Chia-hui Chang 張嘉惠 2009 學位論文 ; thesis 48 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊工程研究所 === 97 === Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach.
author2 Chia-hui Chang
author_facet Chia-hui Chang
Chieh-Cheng Yang
楊傑程
author Chieh-Cheng Yang
楊傑程
spellingShingle Chieh-Cheng Yang
楊傑程
A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
author_sort Chieh-Cheng Yang
title A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
title_short A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
title_full A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
title_fullStr A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
title_full_unstemmed A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
title_sort two-phase approach to chinese unknown word extraction: application of pattern mining and machine learning
publishDate 2009
url http://ndltd.ncl.edu.tw/handle/28083382194737352825
work_keys_str_mv AT chiehchengyang atwophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning
AT yángjiéchéng atwophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning
AT chiehchengyang yīngyòngyàngshìtànkānyǔjīqìxuéxífāngfǎyúzhōngwénwèizhīcíxiéqǔzhīyánjiū
AT yángjiéchéng yīngyòngyàngshìtànkānyǔjīqìxuéxífāngfǎyúzhōngwénwèizhīcíxiéqǔzhīyánjiū
AT chiehchengyang twophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning
AT yángjiéchéng twophaseapproachtochineseunknownwordextractionapplicationofpatternminingandmachinelearning
_version_ 1718254603682709504