Best way for collecting data for low-resourced languages

Low resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language cor...

Full description

Bibliographic Details
Main Author: Karim, Hiva
Format: Others
Language:English
Published: Högskolan Dalarna, Mikrodataanalys 2020
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:du-35945
Description
Summary:Low resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language corpus, making it a big obstacle for experts to provide a comprehensive textprocessing system. In this study, I Found out the best practices for producing and collectingdata for such zero/low resource languages by means of crowd-sourcing. For the purpose of thisstudy, a number of research articles (n=260) were extracted from Google Scholar, MicrosoftAcademic, and science direct. From these articles, only 60 of them, which met the inclusioncriteria' demands, were considered to review for eligibility. A full-text version of these researcharticles was downloaded and then were carefully screened to ensure eligibility. On the result ofthe eligibility assessment from potentially eligible 60 full-text articles for inclusion, only 25were selected and qualified to include in the final review. The final pool of the selected articles,concerning data generation practices and collection of low resource languages, can beconcluded that speech-based audio data is one of the most common and accessible data types.It can be contended that the collection of audio data from speech-based resources such as nativespeakers of the intended language and available audio recording by taking the advantages ofnew technologies is the most practical, cost-effective, and common method for collecting datafor low resource languages.