Named Entity Recognition with Support Vector Machines

This report describes a degree project in Computer Science, the aim of which was to construct a system for Named Entity Recognition in Swedish texts of names of people, locations and organizations, as well as expressions for time. This system was constructed from the part-of-speech tagger Granska an...

Full description

Bibliographic Details
Main Author: MICKELIN, JOEL
Format: Others
Language:English
Published: KTH, Skolan för datavetenskap och kommunikation (CSC) 2013
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-138012
Description
Summary:This report describes a degree project in Computer Science, the aim of which was to construct a system for Named Entity Recognition in Swedish texts of names of people, locations and organizations, as well as expressions for time. This system was constructed from the part-of-speech tagger Granska and the Support Vector Machine system SVMlin. The completed system was trained to recognize Named Entities by analyzing patterns in training corpora consisting of lists of example words belonging to each category. The system was initially trained to recognize patterns based on individual characters in words, but was later rewritten to recognize other characteristics of individual words such as the types of characters the words contained. When evaluating the system, it was determined that no incarnation of the system managed to perform satisfactorily when tested to recognize Named Entities of the aforementioned categories. A possible reason for this is that three of the categories, i.e. names of people, names of locations and names of organizations have few or no distinguishing features between them, which might warrant more research. The system proved apt when tested with solving the related problem of distinguishing email addresses from other named entities, indicating that the system might be of use in some cases of Named Entity Recognition. === Denna rapport beskriver ett examensarbete inom datalogi, målet med vilket var att konstruera ett system för igenkänning i svensk text av Named Entities för personnamn, platsnamn och namn på organisationer, samt tidsangivelser. Systemet konstruerades utgående från part-of-speech-taggaren Granska samt supportvektormaskinsystemet SVMlin. Det färdiga systemet tränades att känna igen Named Entities genom att analysera mönster i träningscorpora bestående av listor på exempelord tillhörande varje kategori. Systemet tränades först att känna igen mönster baserade på enskilda tecken i ord, men skrevs sedan om för att känna igen andra karakteristika hos enskilda ord såsom vilka slags tecken de innehåller. När systemet evaluerades framkom att ingen version av det fungerade tillfredsställande när det testades att känna igen Named Entities av ovan nämnda kategorier. En möjlig orsak till detta kan vara att tre av kategorierna, personnamn, platsnamn och namn på organisationer har få eller inga inneboende skillnader sinsemellan, vilket kan bli grund till mer forskning. Systemet visade sig dugligt när det prövades att lösa det relaterade problemet att särskilja mailadresser från andra Named Entities, vilket kan tyda på att systemet kan användas för viss typ av igenkänning av Named Entities.