A Framework for Web NER Model Trainingbased on Semi-supervised Learning

博士 === 國立中央大學 === 資訊工程學系 === 107 === Named entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. The...

Full description

Bibliographic Details
Main Authors: Chien-Lung Chou, 周建龍
Other Authors: Chia-Hui Chang
Format: Others
Language:en_US
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/7s3a53
id ndltd-TW-107NCU05392019
record_format oai_dc
spelling ndltd-TW-107NCU053920192019-06-27T05:42:35Z http://ndltd.ncl.edu.tw/handle/7s3a53 A Framework for Web NER Model Trainingbased on Semi-supervised Learning 基於半監督式學習的網路命名實體辨識模型訓練框架 Chien-Lung Chou 周建龍 博士 國立中央大學 資訊工程學系 107 Named entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. These entities are essential to numerous text applications, such as those used for analyzing public opinion on social networks, and to the interfaces used to conduct interactive conversations and provide intelligent customer services. However, existing natural language processing (NLP) tools (such as Stanford named entity recognizer) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a framework for NER model training is essential for low-resource language or entity information extraction (IE). Building a customized NER model often requires a significant amount of time to prepare, annotate, and evaluate the training/testing and language-dependent feature engineering. Existing studies rely on annotated training data; however, it is quite expensive to obtain large datasets, thus limiting the effectiveness of recognition. In this thesis, we examine the problem of developing a framework to prepare a training corpus from the web with known entities for custom NER model training via semi-supervised learning. We consider the effectiveness and efficiency problems of automatic labeling and language-independent feature mining to prepare and annotate the training data. The major challenge of automatic labeling lies in the choice of labeling strategies to avoid false positive and false negative examples, due to short and long seeds, and a long labeling time, due to large corpus and seed entities. Distant supervision, which collects training sentences from search snippets with known entities is not new; however, the efficiency of automatic labeling becomes critical when dealing with a large number of known entities (e.g., 550k) and sentences (e.g., 2M). Additionally, to address the language-dependent feature mining for supervised learning, we modify tri-training for sequence labeling and derive a proper initialization for large dataset training to improve the entity recognition performance for a large corpus. We conduct experiments on five types of entity recognition tasks including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed web NER model construction framework. Chia-Hui Chang 張嘉惠 2019 學位論文 ; thesis 86 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立中央大學 === 資訊工程學系 === 107 === Named entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. These entities are essential to numerous text applications, such as those used for analyzing public opinion on social networks, and to the interfaces used to conduct interactive conversations and provide intelligent customer services. However, existing natural language processing (NLP) tools (such as Stanford named entity recognizer) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a framework for NER model training is essential for low-resource language or entity information extraction (IE). Building a customized NER model often requires a significant amount of time to prepare, annotate, and evaluate the training/testing and language-dependent feature engineering. Existing studies rely on annotated training data; however, it is quite expensive to obtain large datasets, thus limiting the effectiveness of recognition. In this thesis, we examine the problem of developing a framework to prepare a training corpus from the web with known entities for custom NER model training via semi-supervised learning. We consider the effectiveness and efficiency problems of automatic labeling and language-independent feature mining to prepare and annotate the training data. The major challenge of automatic labeling lies in the choice of labeling strategies to avoid false positive and false negative examples, due to short and long seeds, and a long labeling time, due to large corpus and seed entities. Distant supervision, which collects training sentences from search snippets with known entities is not new; however, the efficiency of automatic labeling becomes critical when dealing with a large number of known entities (e.g., 550k) and sentences (e.g., 2M). Additionally, to address the language-dependent feature mining for supervised learning, we modify tri-training for sequence labeling and derive a proper initialization for large dataset training to improve the entity recognition performance for a large corpus. We conduct experiments on five types of entity recognition tasks including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed web NER model construction framework.
author2 Chia-Hui Chang
author_facet Chia-Hui Chang
Chien-Lung Chou
周建龍
author Chien-Lung Chou
周建龍
spellingShingle Chien-Lung Chou
周建龍
A Framework for Web NER Model Trainingbased on Semi-supervised Learning
author_sort Chien-Lung Chou
title A Framework for Web NER Model Trainingbased on Semi-supervised Learning
title_short A Framework for Web NER Model Trainingbased on Semi-supervised Learning
title_full A Framework for Web NER Model Trainingbased on Semi-supervised Learning
title_fullStr A Framework for Web NER Model Trainingbased on Semi-supervised Learning
title_full_unstemmed A Framework for Web NER Model Trainingbased on Semi-supervised Learning
title_sort framework for web ner model trainingbased on semi-supervised learning
publishDate 2019
url http://ndltd.ncl.edu.tw/handle/7s3a53
work_keys_str_mv AT chienlungchou aframeworkforwebnermodeltrainingbasedonsemisupervisedlearning
AT zhōujiànlóng aframeworkforwebnermodeltrainingbasedonsemisupervisedlearning
AT chienlungchou jīyúbànjiāndūshìxuéxídewǎnglùmìngmíngshítǐbiànshímóxíngxùnliànkuāngjià
AT zhōujiànlóng jīyúbànjiāndūshìxuéxídewǎnglùmìngmíngshítǐbiànshímóxíngxùnliànkuāngjià
AT chienlungchou frameworkforwebnermodeltrainingbasedonsemisupervisedlearning
AT zhōujiànlóng frameworkforwebnermodeltrainingbasedonsemisupervisedlearning
_version_ 1719213066970726400