Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Ro...

Full description

Bibliographic Details
Main Authors:	Maria Mitrofan, Verginica Barbu Mititelu, Grigorina Mitrofan
Format:	Article
Language:	English
Published:	MDPI AG 2018-11-01
Series:	Data
Subjects:	corpus biomedical Romanian part-of-speech tags named entities
Online Access:	https://www.mdpi.com/2306-5729/3/4/53

id	doaj-9f847878789b4fbe8add10ccb15c1d9c
record_format	Article
spelling	doaj-9f847878789b4fbe8add10ccb15c1d9c2020-11-24T21:23:00ZengMDPI AGData2306-57292018-11-01345310.3390/data3040053data3040053Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian LanguageMaria Mitrofan0Verginica Barbu Mititelu1Grigorina Mitrofan2Romanian Academy Research Institute for Artificial Intelligence, 13 Calea 13 Septembrie, Bucharest 050711, RomaniaRomanian Academy Research Institute for Artificial Intelligence, 13 Calea 13 Septembrie, Bucharest 050711, RomaniaNational Institute of Diabetes and Metabolic Diseases “N.C. Paulescu”, 5-7 Ion Movilă Street, Bucharest 020475, RomaniaGold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.https://www.mdpi.com/2306-5729/3/4/53corpusbiomedicalRomanianpart-of-speech tagsnamed entities
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Maria Mitrofan Verginica Barbu Mititelu Grigorina Mitrofan
spellingShingle	Maria Mitrofan Verginica Barbu Mititelu Grigorina Mitrofan Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language Data corpus biomedical Romanian part-of-speech tags named entities
author_facet	Maria Mitrofan Verginica Barbu Mititelu Grigorina Mitrofan
author_sort	Maria Mitrofan
title	Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language
title_short	Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language
title_full	Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language
title_fullStr	Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language
title_full_unstemmed	Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language
title_sort	towards the construction of a gold standard biomedical corpus for the romanian language
publisher	MDPI AG
series	Data
issn	2306-5729
publishDate	2018-11-01
description	Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.
topic	corpus biomedical Romanian part-of-speech tags named entities
url	https://www.mdpi.com/2306-5729/3/4/53
work_keys_str_mv	AT mariamitrofan towardstheconstructionofagoldstandardbiomedicalcorpusfortheromanianlanguage AT verginicabarbumititelu towardstheconstructionofagoldstandardbiomedicalcorpusfortheromanianlanguage AT grigorinamitrofan towardstheconstructionofagoldstandardbiomedicalcorpusfortheromanianlanguage
_version_	1725994034433884160

Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

Similar Items