A pattern-based approach to information extraction on Chinese text

碩士 === 國立政治大學 === 資訊科學學系 === 92 === With the explosion of World Wide Web, information extraction has become a major technical area. The goal of information extraction is to transform non-structured text into structured data of specific topic. It involves analyzing, filtering and extracting releva...

Full description

Bibliographic Details
Main Authors:	Chia-Wei Weng, 翁嘉緯
Other Authors:	Jyi-Shane Liu
Format:	Others
Language:	zh-TW
Published:	2003
Online Access:	http://ndltd.ncl.edu.tw/handle/86605075588095500351

id	ndltd-TW-092NCCU5394001
record_format	oai_dc
spelling	ndltd-TW-092NCCU53940012015-10-13T16:23:07Z http://ndltd.ncl.edu.tw/handle/86605075588095500351 A pattern-based approach to information extraction on Chinese text 以型態辨識為主的中文資訊擷取技術研究 Chia-Wei Weng 翁嘉緯碩士國立政治大學資訊科學學系 92 With the explosion of World Wide Web, information extraction has become a major technical area. The goal of information extraction is to transform non-structured text into structured data of specific topic. It involves analyzing, filtering and extracting relevant parts of text and the corresponding meaning. Most information extraction research mainly focuses on English text. On the other hand, research on Chinese information extraction has not received as much attention. Considering the fact that one-fifth population in the world are Chinese-speaking people, Chinese information extraction technology will become increasingly important. Chinese language is different with English in many aspects. In English, words are separated with space such that computers can easily distinguish each word in the input string. In Chinese, there are no spaces between characters to segment them into meaningful words. A general solution is to match characters of the input string to the words in the dictionary to find proper word boundary. Yet, much flexibility and ambiguity exist in the combination of characters into words. Many errors may occur in word segmentation. . In this thesis, we propose an approach to Chinese information extraction based on pattern matching and finite state automata, without relying on word segmentation and part-of-speech tagging. The approach was evaluated with “government personnel directives in official gazettes” as test data, and it achieved performance measure of 98% precision and 97% recall. Moreover, the approach was extended to other data domains. The results have showed initial progress on the research of multiple- domain Chinese information extraction system. Jyi-Shane Liu 劉吉軒 2003 學位論文 ; thesis 144 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立政治大學 === 資訊科學學系 === 92 === With the explosion of World Wide Web, information extraction has become a major technical area. The goal of information extraction is to transform non-structured text into structured data of specific topic. It involves analyzing, filtering and extracting relevant parts of text and the corresponding meaning. Most information extraction research mainly focuses on English text. On the other hand, research on Chinese information extraction has not received as much attention. Considering the fact that one-fifth population in the world are Chinese-speaking people, Chinese information extraction technology will become increasingly important. Chinese language is different with English in many aspects. In English, words are separated with space such that computers can easily distinguish each word in the input string. In Chinese, there are no spaces between characters to segment them into meaningful words. A general solution is to match characters of the input string to the words in the dictionary to find proper word boundary. Yet, much flexibility and ambiguity exist in the combination of characters into words. Many errors may occur in word segmentation. . In this thesis, we propose an approach to Chinese information extraction based on pattern matching and finite state automata, without relying on word segmentation and part-of-speech tagging. The approach was evaluated with “government personnel directives in official gazettes” as test data, and it achieved performance measure of 98% precision and 97% recall. Moreover, the approach was extended to other data domains. The results have showed initial progress on the research of multiple- domain Chinese information extraction system.
author2	Jyi-Shane Liu
author_facet	Jyi-Shane Liu Chia-Wei Weng 翁嘉緯
author	Chia-Wei Weng 翁嘉緯
spellingShingle	Chia-Wei Weng 翁嘉緯 A pattern-based approach to information extraction on Chinese text
author_sort	Chia-Wei Weng
title	A pattern-based approach to information extraction on Chinese text
title_short	A pattern-based approach to information extraction on Chinese text
title_full	A pattern-based approach to information extraction on Chinese text
title_fullStr	A pattern-based approach to information extraction on Chinese text
title_full_unstemmed	A pattern-based approach to information extraction on Chinese text
title_sort	pattern-based approach to information extraction on chinese text
publishDate	2003
url	http://ndltd.ncl.edu.tw/handle/86605075588095500351
work_keys_str_mv	AT chiaweiweng apatternbasedapproachtoinformationextractiononchinesetext AT wēngjiāwěi apatternbasedapproachtoinformationextractiononchinesetext AT chiaweiweng yǐxíngtàibiànshíwèizhǔdezhōngwénzīxùnxiéqǔjìshùyánjiū AT wēngjiāwěi yǐxíngtàibiànshíwèizhǔdezhōngwénzīxùnxiéqǔjìshùyánjiū AT chiaweiweng patternbasedapproachtoinformationextractiononchinesetext AT wēngjiāwěi patternbasedapproachtoinformationextractiononchinesetext
_version_	1717770709373026304

A pattern-based approach to information extraction on Chinese text

Similar Items