|
|
|
|
LEADER |
01503nam a2200205Ia 4500 |
001 |
10.1186-s40537-022-00590-7 |
008 |
220425s2022 CNT 000 0 und d |
020 |
|
|
|a 21961115 (ISSN)
|
245 |
1 |
0 |
|a Pre-trained transformer-based language models for Sundanese
|
260 |
|
0 |
|b Springer Science and Business Media Deutschland GmbH
|c 2022
|
856 |
|
|
|z View Fulltext in Publisher
|u https://doi.org/10.1186/s40537-022-00590-7
|
520 |
3 |
|
|a The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use. © 2022, The Author(s).
|
650 |
0 |
4 |
|a Low-resource Language
|
650 |
0 |
4 |
|a Natural Language Understanding
|
650 |
0 |
4 |
|a Sundanese Language
|
650 |
0 |
4 |
|a Transformers
|
700 |
1 |
|
|a Lucky, H.
|e author
|
700 |
1 |
|
|a Suhartono, D.
|e author
|
700 |
1 |
|
|a Wongso, W.
|e author
|
773 |
|
|
|t Journal of Big Data
|