Old Catalan Morphosyntax: Developing an Annotated Corpus

This paper presents a full procedure for the development of a Part-of-Speech (POS) tagged corpus of Old Catalan. As an extremely low-resource language with rich inflection and frequent homographs, Old Catalan poses non-trivial problems in the development of a searchable constituency-based treebank....

Full description

Bibliographic Details
Published in:	Journal of Open Humanities Data
Main Authors:	Marieke Meelen, Afra Pujol i Campeny
Format:	Article
Language:	English
Published:	Ubiquity Press 2021-12-01
Subjects:	old catalan pos tagging historical treebank
Online Access:	https://openhumanitiesdata.metajnl.com/articles/54

Description
Summary:	This paper presents a full procedure for the development of a Part-of-Speech (POS) tagged corpus of Old Catalan. As an extremely low-resource language with rich inflection and frequent homographs, Old Catalan poses non-trivial problems in the development of a searchable constituency-based treebank. We demonstrate, however, that a semi- supervised method of incrementally building training data using both neural and memory-based taggers, together with the Pyrrha annotation tool is highly efficient and yields accurate results. We propose that this simple and effective method could easily be extended to other low-resource historical languages for which no NLP tools exist yet.
ISSN:	2059-481X

Old Catalan Morphosyntax: Developing an Annotated Corpus

Similar Items