Cross-Lingual Alignment of Word & Sentence Embeddings

<p>One of the notable developments in current natural language processing is the practical efficacy of probabilistic word representations, where words are embedded in high-dimensional continuous vector spaces that are optimized to reflect their distributional relationships. For sequences of wo...

Full description

Bibliographic Details
Main Author: Aldarmaki, Hanan
Language:EN
Published: The George Washington University 2019
Subjects:
Online Access:http://pqdtopen.proquest.com/#viewpdf?dispub=13812118
id ndltd-PROQUEST-oai-pqdtoai.proquest.com-13812118
record_format oai_dc
spelling ndltd-PROQUEST-oai-pqdtoai.proquest.com-138121182019-04-12T03:46:12Z Cross-Lingual Alignment of Word & Sentence Embeddings Aldarmaki, Hanan Language|Computer science <p>One of the notable developments in current natural language processing is the practical efficacy of probabilistic word representations, where words are embedded in high-dimensional continuous vector spaces that are optimized to reflect their distributional relationships. For sequences of words, such as phrases and sentences, distributional representations can be estimated by combining word embeddings using arithmetic operations like vector averaging or by estimating composition parameters from data using various objective functions. The quality of these compositional representations is typically estimated by their performance as features in extrinsic supervised classification benchmarks. Word and compositional embeddings for a single language can be induced without supervision using a large training corpus of raw text. To handle multiple languages and dialects, bilingual dictionaries and parallel corpora are often used for learning cross-lingual embeddings directly or to align pre-trained monolingual embeddings. In this work, we explore and develop various cross-lingual alignment techniques, compare the performance of the resulting cross-lingual embeddings, and study their characteristics. We pay particular attention to the bilingual data requirements of each approach since lower requirements facilitate wider language expansion. To begin with, we analyze various monolingual general-purpose sentence embedding models to better understand their qualities. By comparing their performance on extrinsic evaluation benchmarks and unsupervised clustering, we infer the characteristics of the most dominant features in their respective vector spaces. We then look into various cross-lingual alignment frameworks with different degrees of supervision. We begin with unsupervised word alignment, for which we propose an approach for inducing cross-lingual word mappings with no prior bilingual resources. We rely on assumptions about the consistency and structural similarities between the monolingual vector spaces of different languages. Using comparable monolingual news corpora, our approach resulted in highly accurate word mappings for two language pairs: French to English, and Arabic to English. With various refinement heuristics, the performance of the unsupervised alignment methods approached the performance of supervised dictionary mapping. Finally, we develop and evaluate different alignment approaches based on parallel text. We show that incorporating context in the alignment process often leads to significant improvements in performance. At the word level, we explore the alignment of contextualized word embeddings that are dynamically generated for each sentence. At the sentence level, we develop and investigate three alignment frameworks: joint modeling, representation transfer, and sentence mapping, applied to different sentence embedding models. We experiment with a matrix factorization model based on word-sentence co-occurrence statistics, and two general-purpose neural sentence embedding models. We report the performance of the various cross-lingual models with different sizes of parallel corpora to assess the minimal degree of supervision required by each alignment framework. The George Washington University 2019-04-11 00:00:00.0 thesis http://pqdtopen.proquest.com/#viewpdf?dispub=13812118 EN
collection NDLTD
language EN
sources NDLTD
topic Language|Computer science
spellingShingle Language|Computer science
Aldarmaki, Hanan
Cross-Lingual Alignment of Word & Sentence Embeddings
description <p>One of the notable developments in current natural language processing is the practical efficacy of probabilistic word representations, where words are embedded in high-dimensional continuous vector spaces that are optimized to reflect their distributional relationships. For sequences of words, such as phrases and sentences, distributional representations can be estimated by combining word embeddings using arithmetic operations like vector averaging or by estimating composition parameters from data using various objective functions. The quality of these compositional representations is typically estimated by their performance as features in extrinsic supervised classification benchmarks. Word and compositional embeddings for a single language can be induced without supervision using a large training corpus of raw text. To handle multiple languages and dialects, bilingual dictionaries and parallel corpora are often used for learning cross-lingual embeddings directly or to align pre-trained monolingual embeddings. In this work, we explore and develop various cross-lingual alignment techniques, compare the performance of the resulting cross-lingual embeddings, and study their characteristics. We pay particular attention to the bilingual data requirements of each approach since lower requirements facilitate wider language expansion. To begin with, we analyze various monolingual general-purpose sentence embedding models to better understand their qualities. By comparing their performance on extrinsic evaluation benchmarks and unsupervised clustering, we infer the characteristics of the most dominant features in their respective vector spaces. We then look into various cross-lingual alignment frameworks with different degrees of supervision. We begin with unsupervised word alignment, for which we propose an approach for inducing cross-lingual word mappings with no prior bilingual resources. We rely on assumptions about the consistency and structural similarities between the monolingual vector spaces of different languages. Using comparable monolingual news corpora, our approach resulted in highly accurate word mappings for two language pairs: French to English, and Arabic to English. With various refinement heuristics, the performance of the unsupervised alignment methods approached the performance of supervised dictionary mapping. Finally, we develop and evaluate different alignment approaches based on parallel text. We show that incorporating context in the alignment process often leads to significant improvements in performance. At the word level, we explore the alignment of contextualized word embeddings that are dynamically generated for each sentence. At the sentence level, we develop and investigate three alignment frameworks: joint modeling, representation transfer, and sentence mapping, applied to different sentence embedding models. We experiment with a matrix factorization model based on word-sentence co-occurrence statistics, and two general-purpose neural sentence embedding models. We report the performance of the various cross-lingual models with different sizes of parallel corpora to assess the minimal degree of supervision required by each alignment framework.
author Aldarmaki, Hanan
author_facet Aldarmaki, Hanan
author_sort Aldarmaki, Hanan
title Cross-Lingual Alignment of Word & Sentence Embeddings
title_short Cross-Lingual Alignment of Word & Sentence Embeddings
title_full Cross-Lingual Alignment of Word & Sentence Embeddings
title_fullStr Cross-Lingual Alignment of Word & Sentence Embeddings
title_full_unstemmed Cross-Lingual Alignment of Word & Sentence Embeddings
title_sort cross-lingual alignment of word & sentence embeddings
publisher The George Washington University
publishDate 2019
url http://pqdtopen.proquest.com/#viewpdf?dispub=13812118
work_keys_str_mv AT aldarmakihanan crosslingualalignmentofwordsentenceembeddings
_version_ 1719017511726350336