Morphological segmentation method for Turkic language neural machine translation

Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corp...

Full description

Bibliographic Details
Main Authors: U. Tukeyev, A. Karibayeva, Z h. Zhumanov
Format: Article
Language:English
Published: Taylor & Francis Group 2020-01-01
Series:Cogent Engineering
Subjects:
Online Access:http://dx.doi.org/10.1080/23311916.2020.1856500
id doaj-a7fecba605c9481998eec25cd64f0095
record_format Article
spelling doaj-a7fecba605c9481998eec25cd64f00952021-06-21T13:17:40ZengTaylor & Francis GroupCogent Engineering2331-19162020-01-017110.1080/23311916.2020.18565001856500Morphological segmentation method for Turkic language neural machine translationU. Tukeyev0A. Karibayeva1Z h. Zhumanov2Al-Farabi Kazakh National UniversityAl-Farabi Kazakh National UniversityAl-Farabi Kazakh National UniversityDictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corpora. Therefore, this study introduces a new morphological segmentation approach for Turkic languages based on the complete set of endings (CSE), which reduces the vocabulary volume of the source corpora. Herein, we demonstrate the proposed CSE-based morphological segmentation method for the Kazakh, Kyrgyz, and Uzbek languages and present the results of computational NMT experiments for the Kazakh language. The NMT experiment results show that in comparison with byte-pair encoding (BPE)-based segmentation, the proposed CSE-based segmentation increases the bilingual evaluation understudy score of 0.5 and 0.2 points on average for Kazakh–English and English–Kazakh pairs, respectively. Furthermore, in comparison with the BPE-based segmentation, the proposed CSE-based segmentation approach reduced the vocabulary size in NMT by more than a factor of two. This feature of the proposed segmentation approach will be crucial for NMT as the size of the source corpora is increased to improve translation quality.http://dx.doi.org/10.1080/23311916.2020.1856500neural machine translationmorphological segmentationturkic languageskazakhkyrgyzuzbek
collection DOAJ
language English
format Article
sources DOAJ
author U. Tukeyev
A. Karibayeva
Z h. Zhumanov
spellingShingle U. Tukeyev
A. Karibayeva
Z h. Zhumanov
Morphological segmentation method for Turkic language neural machine translation
Cogent Engineering
neural machine translation
morphological segmentation
turkic languages
kazakh
kyrgyz
uzbek
author_facet U. Tukeyev
A. Karibayeva
Z h. Zhumanov
author_sort U. Tukeyev
title Morphological segmentation method for Turkic language neural machine translation
title_short Morphological segmentation method for Turkic language neural machine translation
title_full Morphological segmentation method for Turkic language neural machine translation
title_fullStr Morphological segmentation method for Turkic language neural machine translation
title_full_unstemmed Morphological segmentation method for Turkic language neural machine translation
title_sort morphological segmentation method for turkic language neural machine translation
publisher Taylor & Francis Group
series Cogent Engineering
issn 2331-1916
publishDate 2020-01-01
description Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corpora. Therefore, this study introduces a new morphological segmentation approach for Turkic languages based on the complete set of endings (CSE), which reduces the vocabulary volume of the source corpora. Herein, we demonstrate the proposed CSE-based morphological segmentation method for the Kazakh, Kyrgyz, and Uzbek languages and present the results of computational NMT experiments for the Kazakh language. The NMT experiment results show that in comparison with byte-pair encoding (BPE)-based segmentation, the proposed CSE-based segmentation increases the bilingual evaluation understudy score of 0.5 and 0.2 points on average for Kazakh–English and English–Kazakh pairs, respectively. Furthermore, in comparison with the BPE-based segmentation, the proposed CSE-based segmentation approach reduced the vocabulary size in NMT by more than a factor of two. This feature of the proposed segmentation approach will be crucial for NMT as the size of the source corpora is increased to improve translation quality.
topic neural machine translation
morphological segmentation
turkic languages
kazakh
kyrgyz
uzbek
url http://dx.doi.org/10.1080/23311916.2020.1856500
work_keys_str_mv AT utukeyev morphologicalsegmentationmethodforturkiclanguageneuralmachinetranslation
AT akaribayeva morphologicalsegmentationmethodforturkiclanguageneuralmachinetranslation
AT zhzhumanov morphologicalsegmentationmethodforturkiclanguageneuralmachinetranslation
_version_ 1721367763920879616