Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the qua...

Full description

Bibliographic Details
Main Authors: Arne Defauw, Sara Szoc, Anna Bardadym, Joris Brabers, Frederic Everaert, Roko Mijic, Kim Scholte, Tom Vanallemeersch, Koen Van Winckel, Joachim Van den Bogaert
Format: Article
Language:English
Published: MDPI AG 2019-09-01
Series:Informatics
Subjects:
Online Access:https://www.mdpi.com/2227-9709/6/3/35