A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline de...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
Uppsala universitet, Institutionen för lingvistik och filologi
2018
|
Subjects: | |
Online Access: | http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-352450 |
id |
ndltd-UPSALLA1-oai-DiVA.org-uu-352450 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UPSALLA1-oai-DiVA.org-uu-3524502018-06-12T06:20:30ZA Pipeline for Automatic Lexical Normalization of Swedish Student WritingsengLiu, YuhanUppsala universitet, Institutionen för lingvistik och filologi2018Lexical normalizationPhonetic algorithm for SwedishLanguage Technology (Computational Linguistics)Språkteknologi (språkvetenskaplig databehandling)In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline design should consider the unique morphological and phonological characteristics of the Swedish language. This kind of localization makes the system more robust for Swedish at the cost of being less applicable to other languages in similar tasks. The core of the localization lies in a phonetic algorithm we designed specifically for the Swedish language and a compound processing step for Swedish compounding phenomenon. The proposed pipeline consists of four steps, namely preprocessing, identification of out-of-vocabulary words, generation of normalization candidates and candidate selection. For each step we use different approaches. We perform experiments on the Uppsala Corpus of Student Writings (UCSW) (Megyesi et al., 2016), and evaluate the results in termsof precision, recall and accuracy measures. The techniques applied to the raw data and their impacts on the final result are presented. In our evaluation, we show that the pipeline can be useful in the lexical normalization task and our phonetic algorithm is proven to be effective for the Swedish language. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-352450application/pdfinfo:eu-repo/semantics/openAccess |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
Lexical normalization Phonetic algorithm for Swedish Language Technology (Computational Linguistics) Språkteknologi (språkvetenskaplig databehandling) |
spellingShingle |
Lexical normalization Phonetic algorithm for Swedish Language Technology (Computational Linguistics) Språkteknologi (språkvetenskaplig databehandling) Liu, Yuhan A Pipeline for Automatic Lexical Normalization of Swedish Student Writings |
description |
In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline design should consider the unique morphological and phonological characteristics of the Swedish language. This kind of localization makes the system more robust for Swedish at the cost of being less applicable to other languages in similar tasks. The core of the localization lies in a phonetic algorithm we designed specifically for the Swedish language and a compound processing step for Swedish compounding phenomenon. The proposed pipeline consists of four steps, namely preprocessing, identification of out-of-vocabulary words, generation of normalization candidates and candidate selection. For each step we use different approaches. We perform experiments on the Uppsala Corpus of Student Writings (UCSW) (Megyesi et al., 2016), and evaluate the results in termsof precision, recall and accuracy measures. The techniques applied to the raw data and their impacts on the final result are presented. In our evaluation, we show that the pipeline can be useful in the lexical normalization task and our phonetic algorithm is proven to be effective for the Swedish language. |
author |
Liu, Yuhan |
author_facet |
Liu, Yuhan |
author_sort |
Liu, Yuhan |
title |
A Pipeline for Automatic Lexical Normalization of Swedish Student Writings |
title_short |
A Pipeline for Automatic Lexical Normalization of Swedish Student Writings |
title_full |
A Pipeline for Automatic Lexical Normalization of Swedish Student Writings |
title_fullStr |
A Pipeline for Automatic Lexical Normalization of Swedish Student Writings |
title_full_unstemmed |
A Pipeline for Automatic Lexical Normalization of Swedish Student Writings |
title_sort |
pipeline for automatic lexical normalization of swedish student writings |
publisher |
Uppsala universitet, Institutionen för lingvistik och filologi |
publishDate |
2018 |
url |
http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-352450 |
work_keys_str_mv |
AT liuyuhan apipelineforautomaticlexicalnormalizationofswedishstudentwritings AT liuyuhan pipelineforautomaticlexicalnormalizationofswedishstudentwritings |
_version_ |
1718695181337755648 |