AMSunda: A novel dataset for Sundanese information retrievalzenodo

Information Retrieval is crucial in many areas, including Search Engines, Information Systems, and Databases. As an indigenous language, the Sundanese corpus from West Java in Indonesia suffers from limited data availability, especially for Information Retrieval tasks. Previous efforts to build the...

全面介紹

書目詳細資料
發表在:Data in Brief
Main Authors: Aries Maesya, Yulyani Arifin, Amalia Zahra, Widodo Budiharto
格式: Article
語言:英语
出版: Elsevier 2025-08-01
主題:
在線閱讀:http://www.sciencedirect.com/science/article/pii/S2352340925005232
實物特徵
總結:Information Retrieval is crucial in many areas, including Search Engines, Information Systems, and Databases. As an indigenous language, the Sundanese corpus from West Java in Indonesia suffers from limited data availability, especially for Information Retrieval tasks. Previous efforts to build the Sundanese dataset mainly focused on text classification and generation, leaving information retrieval tasks underexplored. To address this gap, we named the AMSunda dataset. The AMSunda dataset was introduced as the first resource designed explicitly for fine-tuning and evaluating embedding models in the Sundanese language. AMSunda dataset consists of two dataset types: (1) triplet data containing a query passage, a positive, and a negative response aimed for fine-tuning embedding models, and (2) BEIR-compatible data structured for evaluating embedding models on retrieval tasks. The dataset consists of 1499 documents generated using GPT-4o-mini LLM, resulting in 7492 triplet passages and 7491 BEIR-format queries. This dataset enables further development of Sundanese-focused models in Information Retrieval.
ISSN:2352-3409