A Framework for Web NER Model Trainingbased on Semi-supervised Learning

博士 === 國立中央大學 === 資訊工程學系 === 107 === Named entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. The...

Full description

Bibliographic Details
Main Authors:	Chien-Lung Chou, 周建龍
Other Authors:	Chia-Hui Chang
Format:	Others
Language:	en_US
Published:	2019
Online Access:	http://ndltd.ncl.edu.tw/handle/7s3a53

id	ndltd-TW-107NCU05392019
record_format	oai_dc
spelling	ndltd-TW-107NCU053920192019-06-27T05:42:35Z http://ndltd.ncl.edu.tw/handle/7s3a53 A Framework for Web NER Model Trainingbased on Semi-supervised Learning 基於半監督式學習的網路命名實體辨識模型訓練框架 Chien-Lung Chou 周建龍博士國立中央大學資訊工程學系 107 Named entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. These entities are essential to numerous text applications, such as those used for analyzing public opinion on social networks, and to the interfaces used to conduct interactive conversations and provide intelligent customer services. However, existing natural language processing (NLP) tools (such as Stanford named entity recognizer) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a framework for NER model training is essential for low-resource language or entity information extraction (IE). Building a customized NER model often requires a significant amount of time to prepare, annotate, and evaluate the training/testing and language-dependent feature engineering. Existing studies rely on annotated training data; however, it is quite expensive to obtain large datasets, thus limiting the effectiveness of recognition. In this thesis, we examine the problem of developing a framework to prepare a training corpus from the web with known entities for custom NER model training via semi-supervised learning. We consider the effectiveness and efficiency problems of automatic labeling and language-independent feature mining to prepare and annotate the training data. The major challenge of automatic labeling lies in the choice of labeling strategies to avoid false positive and false negative examples, due to short and long seeds, and a long labeling time, due to large corpus and seed entities. Distant supervision, which collects training sentences from search snippets with known entities is not new; however, the efficiency of automatic labeling becomes critical when dealing with a large number of known entities (e.g., 550k) and sentences (e.g., 2M). Additionally, to address the language-dependent feature mining for supervised learning, we modify tri-training for sequence labeling and derive a proper initialization for large dataset training to improve the entity recognition performance for a large corpus. We conduct experiments on five types of entity recognition tasks including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed web NER model construction framework. Chia-Hui Chang 張嘉惠 2019 學位論文 ; thesis 86 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	博士 === 國立中央大學 === 資訊工程學系 === 107 === Named entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. These entities are essential to numerous text applications, such as those used for analyzing public opinion on social networks, and to the interfaces used to conduct interactive conversations and provide intelligent customer services. However, existing natural language processing (NLP) tools (such as Stanford named entity recognizer) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a framework for NER model training is essential for low-resource language or entity information extraction (IE). Building a customized NER model often requires a significant amount of time to prepare, annotate, and evaluate the training/testing and language-dependent feature engineering. Existing studies rely on annotated training data; however, it is quite expensive to obtain large datasets, thus limiting the effectiveness of recognition. In this thesis, we examine the problem of developing a framework to prepare a training corpus from the web with known entities for custom NER model training via semi-supervised learning. We consider the effectiveness and efficiency problems of automatic labeling and language-independent feature mining to prepare and annotate the training data. The major challenge of automatic labeling lies in the choice of labeling strategies to avoid false positive and false negative examples, due to short and long seeds, and a long labeling time, due to large corpus and seed entities. Distant supervision, which collects training sentences from search snippets with known entities is not new; however, the efficiency of automatic labeling becomes critical when dealing with a large number of known entities (e.g., 550k) and sentences (e.g., 2M). Additionally, to address the language-dependent feature mining for supervised learning, we modify tri-training for sequence labeling and derive a proper initialization for large dataset training to improve the entity recognition performance for a large corpus. We conduct experiments on five types of entity recognition tasks including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed web NER model construction framework.
author2	Chia-Hui Chang
author_facet	Chia-Hui Chang Chien-Lung Chou 周建龍
author	Chien-Lung Chou 周建龍
spellingShingle	Chien-Lung Chou 周建龍 A Framework for Web NER Model Trainingbased on Semi-supervised Learning
author_sort	Chien-Lung Chou
title	A Framework for Web NER Model Trainingbased on Semi-supervised Learning
title_short	A Framework for Web NER Model Trainingbased on Semi-supervised Learning
title_full	A Framework for Web NER Model Trainingbased on Semi-supervised Learning
title_fullStr	A Framework for Web NER Model Trainingbased on Semi-supervised Learning
title_full_unstemmed	A Framework for Web NER Model Trainingbased on Semi-supervised Learning
title_sort	framework for web ner model trainingbased on semi-supervised learning
publishDate	2019
url	http://ndltd.ncl.edu.tw/handle/7s3a53
work_keys_str_mv	AT chienlungchou aframeworkforwebnermodeltrainingbasedonsemisupervisedlearning AT zhōujiànlóng aframeworkforwebnermodeltrainingbasedonsemisupervisedlearning AT chienlungchou jīyúbànjiāndūshìxuéxídewǎnglùmìngmíngshítǐbiànshímóxíngxùnliànkuāngjià AT zhōujiànlóng jīyúbànjiāndūshìxuéxídewǎnglùmìngmíngshítǐbiànshímóxíngxùnliànkuāngjià AT chienlungchou frameworkforwebnermodeltrainingbasedonsemisupervisedlearning AT zhōujiànlóng frameworkforwebnermodeltrainingbasedonsemisupervisedlearning
_version_	1719213066970726400

A Framework for Web NER Model Trainingbased on Semi-supervised Learning

Similar Items