A Pipeline for Automatic Lexical Normalization of Swedish Student Writings

In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline de...

Full description

Bibliographic Details
Main Author: Liu, Yuhan
Format: Others
Language:English
Published: Uppsala universitet, Institutionen för lingvistik och filologi 2018
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-352450
id ndltd-UPSALLA1-oai-DiVA.org-uu-352450
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-uu-3524502018-06-12T06:20:30ZA Pipeline for Automatic Lexical Normalization of Swedish Student WritingsengLiu, YuhanUppsala universitet, Institutionen för lingvistik och filologi2018Lexical normalizationPhonetic algorithm for SwedishLanguage Technology (Computational Linguistics)Språkteknologi (språkvetenskaplig databehandling)In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline design should consider the unique morphological and phonological characteristics of the Swedish language. This kind of localization makes the system more robust for Swedish at the cost of being less applicable to other languages in similar tasks. The core of the localization lies in a phonetic algorithm we designed specifically for the Swedish language and a compound processing step for Swedish compounding phenomenon. The proposed pipeline consists of four steps, namely preprocessing, identification of out-of-vocabulary words, generation of normalization candidates and candidate selection. For each step we use different approaches. We perform experiments on the Uppsala Corpus of Student Writings (UCSW) (Megyesi et al., 2016), and evaluate the results in termsof precision, recall and accuracy measures. The techniques applied to the raw data and their impacts on the final result are presented. In our evaluation, we show that the pipeline can be useful in the lexical normalization task and our phonetic algorithm is proven to be effective for the Swedish language. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-352450application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Lexical normalization
Phonetic algorithm for Swedish
Language Technology (Computational Linguistics)
Språkteknologi (språkvetenskaplig databehandling)
spellingShingle Lexical normalization
Phonetic algorithm for Swedish
Language Technology (Computational Linguistics)
Språkteknologi (språkvetenskaplig databehandling)
Liu, Yuhan
A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
description In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline design should consider the unique morphological and phonological characteristics of the Swedish language. This kind of localization makes the system more robust for Swedish at the cost of being less applicable to other languages in similar tasks. The core of the localization lies in a phonetic algorithm we designed specifically for the Swedish language and a compound processing step for Swedish compounding phenomenon. The proposed pipeline consists of four steps, namely preprocessing, identification of out-of-vocabulary words, generation of normalization candidates and candidate selection. For each step we use different approaches. We perform experiments on the Uppsala Corpus of Student Writings (UCSW) (Megyesi et al., 2016), and evaluate the results in termsof precision, recall and accuracy measures. The techniques applied to the raw data and their impacts on the final result are presented. In our evaluation, we show that the pipeline can be useful in the lexical normalization task and our phonetic algorithm is proven to be effective for the Swedish language.
author Liu, Yuhan
author_facet Liu, Yuhan
author_sort Liu, Yuhan
title A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
title_short A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
title_full A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
title_fullStr A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
title_full_unstemmed A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
title_sort pipeline for automatic lexical normalization of swedish student writings
publisher Uppsala universitet, Institutionen för lingvistik och filologi
publishDate 2018
url http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-352450
work_keys_str_mv AT liuyuhan apipelineforautomaticlexicalnormalizationofswedishstudentwritings
AT liuyuhan pipelineforautomaticlexicalnormalizationofswedishstudentwritings
_version_ 1718695181337755648