Grammatical Error Identification for Learners of Chinese as a Foreign Language
This thesis aims to build a system to tackle the task of diagnosing the grammatical errors in sentences written by learners of Chinese as a foreign language with the help of the CRF model (Conditional Random Field). The goal of this task is threefold: 1) identify if the sentence is correct or not,...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
Uppsala universitet, Institutionen för lingvistik och filologi
2018
|
Subjects: | |
Online Access: | http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-361927 |
id |
ndltd-UPSALLA1-oai-DiVA.org-uu-361927 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UPSALLA1-oai-DiVA.org-uu-3619272018-09-29T06:06:57ZGrammatical Error Identification for Learners of Chinese as a Foreign LanguageengXiang, YangUppsala universitet, Institutionen för lingvistik och filologi2018Chinesegrammatical error identificationLanguages and LiteratureSpråk och litteraturThis thesis aims to build a system to tackle the task of diagnosing the grammatical errors in sentences written by learners of Chinese as a foreign language with the help of the CRF model (Conditional Random Field). The goal of this task is threefold: 1) identify if the sentence is correct or not, 2) identify the specific error types in the sentence, 3) find out the location of the identified errors. In this thesis, the task of Chinese grammatical error diagnosis is approached as a sequence tagging problem. The data and evaluation tool come from the previous shared tasks on Chinese Grammatical Error Diagnosis in 2016 and 2017. First, we use characters and POS tags as features to train the model and build the baseline system. We then notice that there are overlapping errors in the data. To solve this problem, we adopt three approaches: filtering out the problematic data, assigning encoding to characters with more than one label and building separate classifiers for each error type. We continue to increase the amount of training data and include syntactic features. The results show that both filtering out the problematic data and including syntactic features have a positive impact on the results. In addition, difference between domains of training data and test data can hurt performance to a large extent. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-361927application/pdfinfo:eu-repo/semantics/openAccess |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
Chinese grammatical error identification Languages and Literature Språk och litteratur |
spellingShingle |
Chinese grammatical error identification Languages and Literature Språk och litteratur Xiang, Yang Grammatical Error Identification for Learners of Chinese as a Foreign Language |
description |
This thesis aims to build a system to tackle the task of diagnosing the grammatical errors in sentences written by learners of Chinese as a foreign language with the help of the CRF model (Conditional Random Field). The goal of this task is threefold: 1) identify if the sentence is correct or not, 2) identify the specific error types in the sentence, 3) find out the location of the identified errors. In this thesis, the task of Chinese grammatical error diagnosis is approached as a sequence tagging problem. The data and evaluation tool come from the previous shared tasks on Chinese Grammatical Error Diagnosis in 2016 and 2017. First, we use characters and POS tags as features to train the model and build the baseline system. We then notice that there are overlapping errors in the data. To solve this problem, we adopt three approaches: filtering out the problematic data, assigning encoding to characters with more than one label and building separate classifiers for each error type. We continue to increase the amount of training data and include syntactic features. The results show that both filtering out the problematic data and including syntactic features have a positive impact on the results. In addition, difference between domains of training data and test data can hurt performance to a large extent. |
author |
Xiang, Yang |
author_facet |
Xiang, Yang |
author_sort |
Xiang, Yang |
title |
Grammatical Error Identification for Learners of Chinese as a Foreign Language |
title_short |
Grammatical Error Identification for Learners of Chinese as a Foreign Language |
title_full |
Grammatical Error Identification for Learners of Chinese as a Foreign Language |
title_fullStr |
Grammatical Error Identification for Learners of Chinese as a Foreign Language |
title_full_unstemmed |
Grammatical Error Identification for Learners of Chinese as a Foreign Language |
title_sort |
grammatical error identification for learners of chinese as a foreign language |
publisher |
Uppsala universitet, Institutionen för lingvistik och filologi |
publishDate |
2018 |
url |
http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-361927 |
work_keys_str_mv |
AT xiangyang grammaticalerroridentificationforlearnersofchineseasaforeignlanguage |
_version_ |
1718743253699788800 |