A Framework for Web NER Model Trainingbased on Semi-supervised Learning

博士 === 國立中央大學 === 資訊工程學系 === 107 === Named entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. The...

Full description

Bibliographic Details
Main Authors: Chien-Lung Chou, 周建龍
Other Authors: Chia-Hui Chang
Format: Others
Language:en_US
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/7s3a53
Description
Summary:博士 === 國立中央大學 === 資訊工程學系 === 107 === Named entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. These entities are essential to numerous text applications, such as those used for analyzing public opinion on social networks, and to the interfaces used to conduct interactive conversations and provide intelligent customer services. However, existing natural language processing (NLP) tools (such as Stanford named entity recognizer) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a framework for NER model training is essential for low-resource language or entity information extraction (IE). Building a customized NER model often requires a significant amount of time to prepare, annotate, and evaluate the training/testing and language-dependent feature engineering. Existing studies rely on annotated training data; however, it is quite expensive to obtain large datasets, thus limiting the effectiveness of recognition. In this thesis, we examine the problem of developing a framework to prepare a training corpus from the web with known entities for custom NER model training via semi-supervised learning. We consider the effectiveness and efficiency problems of automatic labeling and language-independent feature mining to prepare and annotate the training data. The major challenge of automatic labeling lies in the choice of labeling strategies to avoid false positive and false negative examples, due to short and long seeds, and a long labeling time, due to large corpus and seed entities. Distant supervision, which collects training sentences from search snippets with known entities is not new; however, the efficiency of automatic labeling becomes critical when dealing with a large number of known entities (e.g., 550k) and sentences (e.g., 2M). Additionally, to address the language-dependent feature mining for supervised learning, we modify tri-training for sequence labeling and derive a proper initialization for large dataset training to improve the entity recognition performance for a large corpus. We conduct experiments on five types of entity recognition tasks including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed web NER model construction framework.