A Chinese Unknown Words Extraction Model for The Blog Connect
碩士 === 國立中興大學 === 資訊管理學系所 === 102 === Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2014
|
Online Access: | http://ndltd.ncl.edu.tw/handle/39665245517326867209 |
id |
ndltd-TW-102NCHU5396005 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-102NCHU53960052017-10-29T04:34:28Z http://ndltd.ncl.edu.tw/handle/39665245517326867209 A Chinese Unknown Words Extraction Model for The Blog Connect 應用於Blog Connect的中文未知詞擷取模型 Jeng Jie Huang 黃政傑 碩士 國立中興大學 資訊管理學系所 102 Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based on our observation, the queried keywords collected from Blog Connect are mostly incomplete sentence. However, Chinese unknown word extraction methods are mainly designed for processing complete sentence. Therefore, we propose a Chinese unknown word extraction model for the Blog Connect. We utilize a two-phase approach to solve the unknown words problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we use characteristic’s frequency and probability of queried keywords to establish rules. These rules can distinguish whether a queried keyword is including unknown words or not. In extraction phase, we propose a variant of bottom up merging algorithm with rules to get unknown words recursively. The experimental results (F-measure 76.75%) show that our method can increase the performance of Chinese word segmentation for queried keywords. There are 988 unknown words in our experimental data, our method can get the 689 unknown words, but CKIP can only get 573 unknown words. Eric Jui-Lin Lu 呂瑞麟 2014 學位論文 ; thesis 36 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中興大學 === 資訊管理學系所 === 102 === Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based on our observation, the queried keywords collected from Blog Connect are mostly incomplete sentence. However, Chinese unknown word extraction methods are mainly designed for processing complete sentence. Therefore, we propose a Chinese unknown word extraction model for the Blog Connect. We utilize a two-phase approach to solve the unknown words problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we use characteristic’s frequency and probability of queried keywords to establish rules. These rules can distinguish whether a queried keyword is including unknown words or not. In extraction phase, we propose a variant of bottom up merging algorithm with rules to get unknown words recursively. The experimental results (F-measure 76.75%) show that our method can increase the performance of Chinese word segmentation for queried keywords. There are 988 unknown words in our experimental data, our method can get the 689 unknown words, but CKIP can only get 573 unknown words.
|
author2 |
Eric Jui-Lin Lu |
author_facet |
Eric Jui-Lin Lu Jeng Jie Huang 黃政傑 |
author |
Jeng Jie Huang 黃政傑 |
spellingShingle |
Jeng Jie Huang 黃政傑 A Chinese Unknown Words Extraction Model for The Blog Connect |
author_sort |
Jeng Jie Huang |
title |
A Chinese Unknown Words Extraction Model for The Blog Connect |
title_short |
A Chinese Unknown Words Extraction Model for The Blog Connect |
title_full |
A Chinese Unknown Words Extraction Model for The Blog Connect |
title_fullStr |
A Chinese Unknown Words Extraction Model for The Blog Connect |
title_full_unstemmed |
A Chinese Unknown Words Extraction Model for The Blog Connect |
title_sort |
chinese unknown words extraction model for the blog connect |
publishDate |
2014 |
url |
http://ndltd.ncl.edu.tw/handle/39665245517326867209 |
work_keys_str_mv |
AT jengjiehuang achineseunknownwordsextractionmodelfortheblogconnect AT huángzhèngjié achineseunknownwordsextractionmodelfortheblogconnect AT jengjiehuang yīngyòngyúblogconnectdezhōngwénwèizhīcíxiéqǔmóxíng AT huángzhèngjié yīngyòngyúblogconnectdezhōngwénwèizhīcíxiéqǔmóxíng AT jengjiehuang chineseunknownwordsextractionmodelfortheblogconnect AT huángzhèngjié chineseunknownwordsextractionmodelfortheblogconnect |
_version_ |
1718557707554783232 |