AlBERTo: Modeling Italian Social Media Language with BERT

Natural Language Processing tasks recently achieved considerable interest and progresses following the development of numerous innovative artificial intelligence models released in recent years. The increase in available computing power has made possible the application of machine learning approache...

Full description

Bibliographic Details
Published in:IJCoL
Main Authors: Marco Polignano, Valerio Basile, Pierpaolo Basile, Marco de Gemmis, Giovanni Semeraro
Format: Article
Language:English
Published: Accademia University Press 2019-12-01
Online Access:https://journals.openedition.org/ijcol/472
_version_ 1849715998000676864
author Marco Polignano
Valerio Basile
Pierpaolo Basile
Marco de Gemmis
Giovanni Semeraro
author_facet Marco Polignano
Valerio Basile
Pierpaolo Basile
Marco de Gemmis
Giovanni Semeraro
author_sort Marco Polignano
collection DOAJ
container_title IJCoL
description Natural Language Processing tasks recently achieved considerable interest and progresses following the development of numerous innovative artificial intelligence models released in recent years. The increase in available computing power has made possible the application of machine learning approaches on a considerable amount of textual data, demonstrating how they can obtain very encouraging results in challenging NLP tasks by generalizing the properties of natural language directly from the data. Models such as ELMo, GPT/GPT-2, BERT, ERNIE, and RoBERTa have proved to be extremely useful in NLP tasks such as entailment, sentiment analysis, and question answering. The availability of these resources mainly in the English language motivated us towards the realization of AlBERTo, a natural language model based on BERT and trained on the Italian language. We decided to train AlBERTo from scratch on social network language, Twitter in particular, because many of the classic tasks of content analysis are oriented to data extracted from the digital sphere of users. The model was distributed to the community through a repository on GitHub and the Transformers library (Wolf et al. 2019) released by the development group huggingface.co. We have evaluated the validity of the model on the classification tasks of sentiment polarity, irony, subjectivity, and hate speech. The specifications of the model, the code developed for training and fine-tuning, and the instructions for using it in a research project are freely available.
format Article
id doaj-art-abfc83a8146b4ff6b92a613c3a487921
institution Directory of Open Access Journals
issn 2499-4553
language English
publishDate 2019-12-01
publisher Accademia University Press
record_format Article
spelling doaj-art-abfc83a8146b4ff6b92a613c3a4879212025-08-20T01:55:06ZengAccademia University PressIJCoL2499-45532019-12-0152113110.4000/ijcol.472AlBERTo: Modeling Italian Social Media Language with BERTMarco PolignanoValerio BasilePierpaolo BasileMarco de GemmisGiovanni SemeraroNatural Language Processing tasks recently achieved considerable interest and progresses following the development of numerous innovative artificial intelligence models released in recent years. The increase in available computing power has made possible the application of machine learning approaches on a considerable amount of textual data, demonstrating how they can obtain very encouraging results in challenging NLP tasks by generalizing the properties of natural language directly from the data. Models such as ELMo, GPT/GPT-2, BERT, ERNIE, and RoBERTa have proved to be extremely useful in NLP tasks such as entailment, sentiment analysis, and question answering. The availability of these resources mainly in the English language motivated us towards the realization of AlBERTo, a natural language model based on BERT and trained on the Italian language. We decided to train AlBERTo from scratch on social network language, Twitter in particular, because many of the classic tasks of content analysis are oriented to data extracted from the digital sphere of users. The model was distributed to the community through a repository on GitHub and the Transformers library (Wolf et al. 2019) released by the development group huggingface.co. We have evaluated the validity of the model on the classification tasks of sentiment polarity, irony, subjectivity, and hate speech. The specifications of the model, the code developed for training and fine-tuning, and the instructions for using it in a research project are freely available.https://journals.openedition.org/ijcol/472
spellingShingle Marco Polignano
Valerio Basile
Pierpaolo Basile
Marco de Gemmis
Giovanni Semeraro
AlBERTo: Modeling Italian Social Media Language with BERT
title AlBERTo: Modeling Italian Social Media Language with BERT
title_full AlBERTo: Modeling Italian Social Media Language with BERT
title_fullStr AlBERTo: Modeling Italian Social Media Language with BERT
title_full_unstemmed AlBERTo: Modeling Italian Social Media Language with BERT
title_short AlBERTo: Modeling Italian Social Media Language with BERT
title_sort alberto modeling italian social media language with bert
url https://journals.openedition.org/ijcol/472
work_keys_str_mv AT marcopolignano albertomodelingitaliansocialmedialanguagewithbert
AT valeriobasile albertomodelingitaliansocialmedialanguagewithbert
AT pierpaolobasile albertomodelingitaliansocialmedialanguagewithbert
AT marcodegemmis albertomodelingitaliansocialmedialanguagewithbert
AT giovannisemeraro albertomodelingitaliansocialmedialanguagewithbert