Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the qua...

Full description

Bibliographic Details
Main Authors: Arne Defauw, Sara Szoc, Anna Bardadym, Joris Brabers, Frederic Everaert, Roko Mijic, Kim Scholte, Tom Vanallemeersch, Koen Van Winckel, Joachim Van den Bogaert
Format: Article
Language:English
Published: MDPI AG 2019-09-01
Series:Informatics
Subjects:
Online Access:https://www.mdpi.com/2227-9709/6/3/35
id doaj-2442fc86137a448db0a3d4a8d3d99a98
record_format Article
spelling doaj-2442fc86137a448db0a3d4a8d3d99a982020-11-25T01:24:05ZengMDPI AGInformatics2227-97092019-09-01633510.3390/informatics6030035informatics6030035Misalignment Detection for Web-Scraped Corpora: A Supervised Regression ApproachArne Defauw0Sara Szoc1Anna Bardadym2Joris Brabers3Frederic Everaert4Roko Mijic5Kim Scholte6Tom Vanallemeersch7Koen Van Winckel8Joachim Van den Bogaert9CrossLang NV, 9050 Gentbrugge, BelgiumCrossLang NV, 9050 Gentbrugge, BelgiumCrossLang NV, 9050 Gentbrugge, BelgiumCrossLang NV, 9050 Gentbrugge, BelgiumCrossLang NV, 9050 Gentbrugge, BelgiumIndependent Data Science Consultant, 9000 Ghent, BelgiumCrossLang NV, 9050 Gentbrugge, BelgiumCrossLang NV, 9050 Gentbrugge, BelgiumCrossLang NV, 9050 Gentbrugge, BelgiumCrossLang NV, 9050 Gentbrugge, BelgiumTo build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR−NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.https://www.mdpi.com/2227-9709/6/3/35data-curationweb crawlingneural machine translation
collection DOAJ
language English
format Article
sources DOAJ
author Arne Defauw
Sara Szoc
Anna Bardadym
Joris Brabers
Frederic Everaert
Roko Mijic
Kim Scholte
Tom Vanallemeersch
Koen Van Winckel
Joachim Van den Bogaert
spellingShingle Arne Defauw
Sara Szoc
Anna Bardadym
Joris Brabers
Frederic Everaert
Roko Mijic
Kim Scholte
Tom Vanallemeersch
Koen Van Winckel
Joachim Van den Bogaert
Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach
Informatics
data-curation
web crawling
neural machine translation
author_facet Arne Defauw
Sara Szoc
Anna Bardadym
Joris Brabers
Frederic Everaert
Roko Mijic
Kim Scholte
Tom Vanallemeersch
Koen Van Winckel
Joachim Van den Bogaert
author_sort Arne Defauw
title Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach
title_short Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach
title_full Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach
title_fullStr Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach
title_full_unstemmed Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach
title_sort misalignment detection for web-scraped corpora: a supervised regression approach
publisher MDPI AG
series Informatics
issn 2227-9709
publishDate 2019-09-01
description To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR−NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance.
topic data-curation
web crawling
neural machine translation
url https://www.mdpi.com/2227-9709/6/3/35
work_keys_str_mv AT arnedefauw misalignmentdetectionforwebscrapedcorporaasupervisedregressionapproach
AT saraszoc misalignmentdetectionforwebscrapedcorporaasupervisedregressionapproach
AT annabardadym misalignmentdetectionforwebscrapedcorporaasupervisedregressionapproach
AT jorisbrabers misalignmentdetectionforwebscrapedcorporaasupervisedregressionapproach
AT fredericeveraert misalignmentdetectionforwebscrapedcorporaasupervisedregressionapproach
AT rokomijic misalignmentdetectionforwebscrapedcorporaasupervisedregressionapproach
AT kimscholte misalignmentdetectionforwebscrapedcorporaasupervisedregressionapproach
AT tomvanallemeersch misalignmentdetectionforwebscrapedcorporaasupervisedregressionapproach
AT koenvanwinckel misalignmentdetectionforwebscrapedcorporaasupervisedregressionapproach
AT joachimvandenbogaert misalignmentdetectionforwebscrapedcorporaasupervisedregressionapproach
_version_ 1725119016449605632