Named-entity recognition in Czech historical texts : Using a CNN-BiLSTM neural network model

The thesis presents named-entity recognition in Czech historical newspapers from Modern Access to Historical Sources Project. Our goal was to create a specific corpus and annotation manual for the project and evaluate neural networks methods for named-entity recognition within the task. We created t...

Full description

Bibliographic Details
Main Author: Hubková, Helena
Format: Others
Language:English
Published: Uppsala universitet, Institutionen för lingvistik och filologi 2019
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-385682
Description
Summary:The thesis presents named-entity recognition in Czech historical newspapers from Modern Access to Historical Sources Project. Our goal was to create a specific corpus and annotation manual for the project and evaluate neural networks methods for named-entity recognition within the task. We created the corpus using scanned Czech historical newspapers. The scanned pages were converted to digitize text by optical character recognition (OCR) method. The data were preprocessed by deleting some OCR errors. We also defined specific named entities types for our task and created an annotation manual with examples for the project. Based on that, we annotated the final corpus. To find the most suitable neural networks model for our task, we experimented with different neural networks architectures, namely long short-term memory (LSTM), bidirectional LSTM and CNN-BiLSTM models. Moreover, we experimented with randomly initialized word embeddings that were trained during the training process and pretrained word embeddings for contemporary Czech published as open source by fastText. We achieved the best result F1 score 0.444 using CNN-BiLSTM model and the pretrained word embeddings by fastText. We found out that we do not need to normalize spelling of our historical texts to get closer to contemporary language if we use the neural network model. We provided a qualitative analysis of observed linguistics phenomena as well. We found out that some word forms and pair of words which were not frequent in our training data set were miss-tagged or not tagged at all. Based on that, we can say that larger data sets could improve the results.