A Pipeline for Automatic Lexical Normalization of Swedish Student Writings

In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline de...

Full description

Bibliographic Details
Main Author:	Liu, Yuhan
Format:	Others
Language:	English
Published:	Uppsala universitet, Institutionen för lingvistik och filologi 2018
Subjects:	Lexical normalization Phonetic algorithm for Swedish Language Technology (Computational Linguistics) Språkteknologi (språkvetenskaplig databehandling)
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-352450

id	ndltd-UPSALLA1-oai-DiVA.org-uu-352450
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-uu-3524502018-06-12T06:20:30ZA Pipeline for Automatic Lexical Normalization of Swedish Student WritingsengLiu, YuhanUppsala universitet, Institutionen för lingvistik och filologi2018Lexical normalizationPhonetic algorithm for SwedishLanguage Technology (Computational Linguistics)Språkteknologi (språkvetenskaplig databehandling)In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline design should consider the unique morphological and phonological characteristics of the Swedish language. This kind of localization makes the system more robust for Swedish at the cost of being less applicable to other languages in similar tasks. The core of the localization lies in a phonetic algorithm we designed specifically for the Swedish language and a compound processing step for Swedish compounding phenomenon. The proposed pipeline consists of four steps, namely preprocessing, identification of out-of-vocabulary words, generation of normalization candidates and candidate selection. For each step we use different approaches. We perform experiments on the Uppsala Corpus of Student Writings (UCSW) (Megyesi et al., 2016), and evaluate the results in termsof precision, recall and accuracy measures. The techniques applied to the raw data and their impacts on the final result are presented. In our evaluation, we show that the pipeline can be useful in the lexical normalization task and our phonetic algorithm is proven to be effective for the Swedish language. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-352450application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Lexical normalization Phonetic algorithm for Swedish Language Technology (Computational Linguistics) Språkteknologi (språkvetenskaplig databehandling)
spellingShingle	Lexical normalization Phonetic algorithm for Swedish Language Technology (Computational Linguistics) Språkteknologi (språkvetenskaplig databehandling) Liu, Yuhan A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
description	In this thesis, we aim to explore the combination of different lexical normalization methods and provide a practical lexical normalization pipeline for Swedish student writings within the framework of SWEGRAM(Näsman et al., 2017). An important improvement in my implementation is that the pipeline design should consider the unique morphological and phonological characteristics of the Swedish language. This kind of localization makes the system more robust for Swedish at the cost of being less applicable to other languages in similar tasks. The core of the localization lies in a phonetic algorithm we designed specifically for the Swedish language and a compound processing step for Swedish compounding phenomenon. The proposed pipeline consists of four steps, namely preprocessing, identification of out-of-vocabulary words, generation of normalization candidates and candidate selection. For each step we use different approaches. We perform experiments on the Uppsala Corpus of Student Writings (UCSW) (Megyesi et al., 2016), and evaluate the results in termsof precision, recall and accuracy measures. The techniques applied to the raw data and their impacts on the final result are presented. In our evaluation, we show that the pipeline can be useful in the lexical normalization task and our phonetic algorithm is proven to be effective for the Swedish language.
author	Liu, Yuhan
author_facet	Liu, Yuhan
author_sort	Liu, Yuhan
title	A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
title_short	A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
title_full	A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
title_fullStr	A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
title_full_unstemmed	A Pipeline for Automatic Lexical Normalization of Swedish Student Writings
title_sort	pipeline for automatic lexical normalization of swedish student writings
publisher	Uppsala universitet, Institutionen för lingvistik och filologi
publishDate	2018
url	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-352450
work_keys_str_mv	AT liuyuhan apipelineforautomaticlexicalnormalizationofswedishstudentwritings AT liuyuhan pipelineforautomaticlexicalnormalizationofswedishstudentwritings
_version_	1718695181337755648

A Pipeline for Automatic Lexical Normalization of Swedish Student Writings

Similar Items