Old Catalan Morphosyntax: Developing an Annotated Corpus

This paper presents a full procedure for the development of a Part-of-Speech (POS) tagged corpus of Old Catalan. As an extremely low-resource language with rich inflection and frequent homographs, Old Catalan poses non-trivial problems in the development of a searchable constituency-based treebank....

Full description

Bibliographic Details
Published in:Journal of Open Humanities Data
Main Authors: Marieke Meelen, Afra Pujol i Campeny
Format: Article
Language:English
Published: Ubiquity Press 2021-12-01
Subjects:
Online Access:https://openhumanitiesdata.metajnl.com/articles/54
Description
Summary:This paper presents a full procedure for the development of a Part-of-Speech (POS) tagged corpus of Old Catalan. As an extremely low-resource language with rich inflection and frequent homographs, Old Catalan poses non-trivial problems in the development of a searchable constituency-based treebank. We demonstrate, however, that a semi- supervised method of incrementally building training data using both neural and memory-based taggers, together with the Pyrrha annotation tool is highly efficient and yields accurate results. We propose that this simple and effective method could easily be extended to other low-resource historical languages for which no NLP tools exist yet.
ISSN:2059-481X