Named Entity Recognition with Support Vector Machines

This report describes a degree project in Computer Science, the aim of which was to construct a system for Named Entity Recognition in Swedish texts of names of people, locations and organizations, as well as expressions for time. This system was constructed from the part-of-speech tagger Granska an...

Full description

Bibliographic Details
Main Author: MICKELIN, JOEL
Format: Others
Language:English
Published: KTH, Skolan för datavetenskap och kommunikation (CSC) 2013
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-138012
id ndltd-UPSALLA1-oai-DiVA.org-kth-138012
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-kth-1380122018-01-12T05:12:31ZNamed Entity Recognition with Support Vector MachinesengMICKELIN, JOELKTH, Skolan för datavetenskap och kommunikation (CSC)2013Computer SciencesDatavetenskap (datalogi)This report describes a degree project in Computer Science, the aim of which was to construct a system for Named Entity Recognition in Swedish texts of names of people, locations and organizations, as well as expressions for time. This system was constructed from the part-of-speech tagger Granska and the Support Vector Machine system SVMlin. The completed system was trained to recognize Named Entities by analyzing patterns in training corpora consisting of lists of example words belonging to each category. The system was initially trained to recognize patterns based on individual characters in words, but was later rewritten to recognize other characteristics of individual words such as the types of characters the words contained. When evaluating the system, it was determined that no incarnation of the system managed to perform satisfactorily when tested to recognize Named Entities of the aforementioned categories. A possible reason for this is that three of the categories, i.e. names of people, names of locations and names of organizations have few or no distinguishing features between them, which might warrant more research. The system proved apt when tested with solving the related problem of distinguishing email addresses from other named entities, indicating that the system might be of use in some cases of Named Entity Recognition. Denna rapport beskriver ett examensarbete inom datalogi, målet med vilket var att konstruera ett system för igenkänning i svensk text av Named Entities för personnamn, platsnamn och namn på organisationer, samt tidsangivelser. Systemet konstruerades utgående från part-of-speech-taggaren Granska samt supportvektormaskinsystemet SVMlin. Det färdiga systemet tränades att känna igen Named Entities genom att analysera mönster i träningscorpora bestående av listor på exempelord tillhörande varje kategori. Systemet tränades först att känna igen mönster baserade på enskilda tecken i ord, men skrevs sedan om för att känna igen andra karakteristika hos enskilda ord såsom vilka slags tecken de innehåller. När systemet evaluerades framkom att ingen version av det fungerade tillfredsställande när det testades att känna igen Named Entities av ovan nämnda kategorier. En möjlig orsak till detta kan vara att tre av kategorierna, personnamn, platsnamn och namn på organisationer har få eller inga inneboende skillnader sinsemellan, vilket kan bli grund till mer forskning. Systemet visade sig dugligt när det prövades att lösa det relaterade problemet att särskilja mailadresser från andra Named Entities, vilket kan tyda på att systemet kan användas för viss typ av igenkänning av Named Entities. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-138012TRITA-CSC-E, 1653-5715 ; 13:112application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Computer Sciences
Datavetenskap (datalogi)
spellingShingle Computer Sciences
Datavetenskap (datalogi)
MICKELIN, JOEL
Named Entity Recognition with Support Vector Machines
description This report describes a degree project in Computer Science, the aim of which was to construct a system for Named Entity Recognition in Swedish texts of names of people, locations and organizations, as well as expressions for time. This system was constructed from the part-of-speech tagger Granska and the Support Vector Machine system SVMlin. The completed system was trained to recognize Named Entities by analyzing patterns in training corpora consisting of lists of example words belonging to each category. The system was initially trained to recognize patterns based on individual characters in words, but was later rewritten to recognize other characteristics of individual words such as the types of characters the words contained. When evaluating the system, it was determined that no incarnation of the system managed to perform satisfactorily when tested to recognize Named Entities of the aforementioned categories. A possible reason for this is that three of the categories, i.e. names of people, names of locations and names of organizations have few or no distinguishing features between them, which might warrant more research. The system proved apt when tested with solving the related problem of distinguishing email addresses from other named entities, indicating that the system might be of use in some cases of Named Entity Recognition. === Denna rapport beskriver ett examensarbete inom datalogi, målet med vilket var att konstruera ett system för igenkänning i svensk text av Named Entities för personnamn, platsnamn och namn på organisationer, samt tidsangivelser. Systemet konstruerades utgående från part-of-speech-taggaren Granska samt supportvektormaskinsystemet SVMlin. Det färdiga systemet tränades att känna igen Named Entities genom att analysera mönster i träningscorpora bestående av listor på exempelord tillhörande varje kategori. Systemet tränades först att känna igen mönster baserade på enskilda tecken i ord, men skrevs sedan om för att känna igen andra karakteristika hos enskilda ord såsom vilka slags tecken de innehåller. När systemet evaluerades framkom att ingen version av det fungerade tillfredsställande när det testades att känna igen Named Entities av ovan nämnda kategorier. En möjlig orsak till detta kan vara att tre av kategorierna, personnamn, platsnamn och namn på organisationer har få eller inga inneboende skillnader sinsemellan, vilket kan bli grund till mer forskning. Systemet visade sig dugligt när det prövades att lösa det relaterade problemet att särskilja mailadresser från andra Named Entities, vilket kan tyda på att systemet kan användas för viss typ av igenkänning av Named Entities.
author MICKELIN, JOEL
author_facet MICKELIN, JOEL
author_sort MICKELIN, JOEL
title Named Entity Recognition with Support Vector Machines
title_short Named Entity Recognition with Support Vector Machines
title_full Named Entity Recognition with Support Vector Machines
title_fullStr Named Entity Recognition with Support Vector Machines
title_full_unstemmed Named Entity Recognition with Support Vector Machines
title_sort named entity recognition with support vector machines
publisher KTH, Skolan för datavetenskap och kommunikation (CSC)
publishDate 2013
url http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-138012
work_keys_str_mv AT mickelinjoel namedentityrecognitionwithsupportvectormachines
_version_ 1718606321461231616