A Chinese Unknown Words Extraction Model for The Blog Connect

碩士 === 國立中興大學 === 資訊管理學系所 === 102 === Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based...

Full description

Bibliographic Details
Main Authors:	Jeng Jie Huang, 黃政傑
Other Authors:	Eric Jui-Lin Lu
Format:	Others
Language:	zh-TW
Published:	2014
Online Access:	http://ndltd.ncl.edu.tw/handle/39665245517326867209

id	ndltd-TW-102NCHU5396005
record_format	oai_dc
spelling	ndltd-TW-102NCHU53960052017-10-29T04:34:28Z http://ndltd.ncl.edu.tw/handle/39665245517326867209 A Chinese Unknown Words Extraction Model for The Blog Connect 應用於Blog Connect的中文未知詞擷取模型 Jeng Jie Huang 黃政傑碩士國立中興大學資訊管理學系所 102 Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based on our observation, the queried keywords collected from Blog Connect are mostly incomplete sentence. However, Chinese unknown word extraction methods are mainly designed for processing complete sentence. Therefore, we propose a Chinese unknown word extraction model for the Blog Connect. We utilize a two-phase approach to solve the unknown words problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we use characteristic’s frequency and probability of queried keywords to establish rules. These rules can distinguish whether a queried keyword is including unknown words or not. In extraction phase, we propose a variant of bottom up merging algorithm with rules to get unknown words recursively. The experimental results (F-measure 76.75%) show that our method can increase the performance of Chinese word segmentation for queried keywords. There are 988 unknown words in our experimental data, our method can get the 689 unknown words, but CKIP can only get 573 unknown words. Eric Jui-Lin Lu 呂瑞麟 2014 學位論文 ; thesis 36 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中興大學 === 資訊管理學系所 === 102 === Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based on our observation, the queried keywords collected from Blog Connect are mostly incomplete sentence. However, Chinese unknown word extraction methods are mainly designed for processing complete sentence. Therefore, we propose a Chinese unknown word extraction model for the Blog Connect. We utilize a two-phase approach to solve the unknown words problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we use characteristic’s frequency and probability of queried keywords to establish rules. These rules can distinguish whether a queried keyword is including unknown words or not. In extraction phase, we propose a variant of bottom up merging algorithm with rules to get unknown words recursively. The experimental results (F-measure 76.75%) show that our method can increase the performance of Chinese word segmentation for queried keywords. There are 988 unknown words in our experimental data, our method can get the 689 unknown words, but CKIP can only get 573 unknown words.
author2	Eric Jui-Lin Lu
author_facet	Eric Jui-Lin Lu Jeng Jie Huang 黃政傑
author	Jeng Jie Huang 黃政傑
spellingShingle	Jeng Jie Huang 黃政傑 A Chinese Unknown Words Extraction Model for The Blog Connect
author_sort	Jeng Jie Huang
title	A Chinese Unknown Words Extraction Model for The Blog Connect
title_short	A Chinese Unknown Words Extraction Model for The Blog Connect
title_full	A Chinese Unknown Words Extraction Model for The Blog Connect
title_fullStr	A Chinese Unknown Words Extraction Model for The Blog Connect
title_full_unstemmed	A Chinese Unknown Words Extraction Model for The Blog Connect
title_sort	chinese unknown words extraction model for the blog connect
publishDate	2014
url	http://ndltd.ncl.edu.tw/handle/39665245517326867209
work_keys_str_mv	AT jengjiehuang achineseunknownwordsextractionmodelfortheblogconnect AT huángzhèngjié achineseunknownwordsextractionmodelfortheblogconnect AT jengjiehuang yīngyòngyúblogconnectdezhōngwénwèizhīcíxiéqǔmóxíng AT huángzhèngjié yīngyòngyúblogconnectdezhōngwénwèizhīcíxiéqǔmóxíng AT jengjiehuang chineseunknownwordsextractionmodelfortheblogconnect AT huángzhèngjié chineseunknownwordsextractionmodelfortheblogconnect
_version_	1718557707554783232

A Chinese Unknown Words Extraction Model for The Blog Connect

Similar Items