Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and...

Full description

Bibliographic Details
Main Authors:	Michael Adjeisah, Guohua Liu, Douglas Omwenga Nyabuga, Richard Nuetey Nortey, Jinling Song
Format:	Article
Language:	English
Published:	Hindawi Limited 2021-01-01
Series:	Computational Intelligence and Neuroscience
Online Access:	http://dx.doi.org/10.1155/2021/6682385

id	doaj-6cf913e8a6cf4f5e8712e613a73135d6
record_format	Article
spelling	doaj-6cf913e8a6cf4f5e8712e613a73135d62021-04-26T00:04:37ZengHindawi LimitedComputational Intelligence and Neuroscience1687-52732021-01-01202110.1155/2021/6682385Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine TranslationMichael Adjeisah0Guohua Liu1Douglas Omwenga Nyabuga2Richard Nuetey Nortey3Jinling Song4School of Computer Science and TechnologySchool of Computer Science and TechnologySchool of Computer Science and TechnologySchool of Information Science and TechnologySchool of Mathematics and Information TechnologyScaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.http://dx.doi.org/10.1155/2021/6682385
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Michael Adjeisah Guohua Liu Douglas Omwenga Nyabuga Richard Nuetey Nortey Jinling Song
spellingShingle	Michael Adjeisah Guohua Liu Douglas Omwenga Nyabuga Richard Nuetey Nortey Jinling Song Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation Computational Intelligence and Neuroscience
author_facet	Michael Adjeisah Guohua Liu Douglas Omwenga Nyabuga Richard Nuetey Nortey Jinling Song
author_sort	Michael Adjeisah
title	Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title_short	Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title_full	Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title_fullStr	Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title_full_unstemmed	Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title_sort	pseudotext injection and advance filtering of low-resource corpus for neural machine translation
publisher	Hindawi Limited
series	Computational Intelligence and Neuroscience
issn	1687-5273
publishDate	2021-01-01
description	Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.
url	http://dx.doi.org/10.1155/2021/6682385
work_keys_str_mv	AT michaeladjeisah pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation AT guohualiu pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation AT douglasomwenganyabuga pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation AT richardnueteynortey pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation AT jinlingsong pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation
_version_	1714657559689822208

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Similar Items