Extracting Parallel Paragraphs from Common Crawl

Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a non-negligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based...

Full description

Bibliographic Details
Main Authors: Kúdela Jakub, Holubová Irena, Bojar Ondřej
Format: Article
Language:English
Published: Sciendo 2017-04-01
Series:Prague Bulletin of Mathematical Linguistics
Online Access:https://doi.org/10.1515/pralin-2017-0003