A Chinese Unknown Words Extraction Model for The Blog Connect

碩士 === 國立中興大學 === 資訊管理學系所 === 102 === Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based...

Full description

Bibliographic Details
Main Authors: Jeng Jie Huang, 黃政傑
Other Authors: Eric Jui-Lin Lu
Format: Others
Language:zh-TW
Published: 2014
Online Access:http://ndltd.ncl.edu.tw/handle/39665245517326867209
id ndltd-TW-102NCHU5396005
record_format oai_dc
spelling ndltd-TW-102NCHU53960052017-10-29T04:34:28Z http://ndltd.ncl.edu.tw/handle/39665245517326867209 A Chinese Unknown Words Extraction Model for The Blog Connect 應用於Blog Connect的中文未知詞擷取模型 Jeng Jie Huang 黃政傑 碩士 國立中興大學 資訊管理學系所 102 Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based on our observation, the queried keywords collected from Blog Connect are mostly incomplete sentence. However, Chinese unknown word extraction methods are mainly designed for processing complete sentence. Therefore, we propose a Chinese unknown word extraction model for the Blog Connect. We utilize a two-phase approach to solve the unknown words problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we use characteristic’s frequency and probability of queried keywords to establish rules. These rules can distinguish whether a queried keyword is including unknown words or not. In extraction phase, we propose a variant of bottom up merging algorithm with rules to get unknown words recursively. The experimental results (F-measure 76.75%) show that our method can increase the performance of Chinese word segmentation for queried keywords. There are 988 unknown words in our experimental data, our method can get the 689 unknown words, but CKIP can only get 573 unknown words. Eric Jui-Lin Lu 呂瑞麟 2014 學位論文 ; thesis 36 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中興大學 === 資訊管理學系所 === 102 === Since there are no blanks to mark word boundaries in original Chinese texts, the main goal of Chinese Words Segmentation is the identification of words. One of the major problems in word segmentation is unknown word (occurrence of out-of-vocabulary word). Based on our observation, the queried keywords collected from Blog Connect are mostly incomplete sentence. However, Chinese unknown word extraction methods are mainly designed for processing complete sentence. Therefore, we propose a Chinese unknown word extraction model for the Blog Connect. We utilize a two-phase approach to solve the unknown words problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we use characteristic’s frequency and probability of queried keywords to establish rules. These rules can distinguish whether a queried keyword is including unknown words or not. In extraction phase, we propose a variant of bottom up merging algorithm with rules to get unknown words recursively. The experimental results (F-measure 76.75%) show that our method can increase the performance of Chinese word segmentation for queried keywords. There are 988 unknown words in our experimental data, our method can get the 689 unknown words, but CKIP can only get 573 unknown words.
author2 Eric Jui-Lin Lu
author_facet Eric Jui-Lin Lu
Jeng Jie Huang
黃政傑
author Jeng Jie Huang
黃政傑
spellingShingle Jeng Jie Huang
黃政傑
A Chinese Unknown Words Extraction Model for The Blog Connect
author_sort Jeng Jie Huang
title A Chinese Unknown Words Extraction Model for The Blog Connect
title_short A Chinese Unknown Words Extraction Model for The Blog Connect
title_full A Chinese Unknown Words Extraction Model for The Blog Connect
title_fullStr A Chinese Unknown Words Extraction Model for The Blog Connect
title_full_unstemmed A Chinese Unknown Words Extraction Model for The Blog Connect
title_sort chinese unknown words extraction model for the blog connect
publishDate 2014
url http://ndltd.ncl.edu.tw/handle/39665245517326867209
work_keys_str_mv AT jengjiehuang achineseunknownwordsextractionmodelfortheblogconnect
AT huángzhèngjié achineseunknownwordsextractionmodelfortheblogconnect
AT jengjiehuang yīngyòngyúblogconnectdezhōngwénwèizhīcíxiéqǔmóxíng
AT huángzhèngjié yīngyòngyúblogconnectdezhōngwénwèizhīcíxiéqǔmóxíng
AT jengjiehuang chineseunknownwordsextractionmodelfortheblogconnect
AT huángzhèngjié chineseunknownwordsextractionmodelfortheblogconnect
_version_ 1718557707554783232