Cross-Lingual Alignment of Word & Sentence Embeddings

<p>One of the notable developments in current natural language processing is the practical efficacy of probabilistic word representations, where words are embedded in high-dimensional continuous vector spaces that are optimized to reflect their distributional relationships. For sequences of wo...

Full description

Bibliographic Details
Main Author:	Aldarmaki, Hanan
Language:	EN
Published:	The George Washington University 2019
Subjects:	Language\|Computer science
Online Access:	http://pqdtopen.proquest.com/#viewpdf?dispub=13812118

id	ndltd-PROQUEST-oai-pqdtoai.proquest.com-13812118
record_format	oai_dc
spelling	ndltd-PROQUEST-oai-pqdtoai.proquest.com-138121182019-04-12T03:46:12Z Cross-Lingual Alignment of Word & Sentence Embeddings Aldarmaki, Hanan Language\|Computer science <p>One of the notable developments in current natural language processing is the practical efficacy of probabilistic word representations, where words are embedded in high-dimensional continuous vector spaces that are optimized to reflect their distributional relationships. For sequences of words, such as phrases and sentences, distributional representations can be estimated by combining word embeddings using arithmetic operations like vector averaging or by estimating composition parameters from data using various objective functions. The quality of these compositional representations is typically estimated by their performance as features in extrinsic supervised classification benchmarks. Word and compositional embeddings for a single language can be induced without supervision using a large training corpus of raw text. To handle multiple languages and dialects, bilingual dictionaries and parallel corpora are often used for learning cross-lingual embeddings directly or to align pre-trained monolingual embeddings. In this work, we explore and develop various cross-lingual alignment techniques, compare the performance of the resulting cross-lingual embeddings, and study their characteristics. We pay particular attention to the bilingual data requirements of each approach since lower requirements facilitate wider language expansion. To begin with, we analyze various monolingual general-purpose sentence embedding models to better understand their qualities. By comparing their performance on extrinsic evaluation benchmarks and unsupervised clustering, we infer the characteristics of the most dominant features in their respective vector spaces. We then look into various cross-lingual alignment frameworks with different degrees of supervision. We begin with unsupervised word alignment, for which we propose an approach for inducing cross-lingual word mappings with no prior bilingual resources. We rely on assumptions about the consistency and structural similarities between the monolingual vector spaces of different languages. Using comparable monolingual news corpora, our approach resulted in highly accurate word mappings for two language pairs: French to English, and Arabic to English. With various refinement heuristics, the performance of the unsupervised alignment methods approached the performance of supervised dictionary mapping. Finally, we develop and evaluate different alignment approaches based on parallel text. We show that incorporating context in the alignment process often leads to significant improvements in performance. At the word level, we explore the alignment of contextualized word embeddings that are dynamically generated for each sentence. At the sentence level, we develop and investigate three alignment frameworks: joint modeling, representation transfer, and sentence mapping, applied to different sentence embedding models. We experiment with a matrix factorization model based on word-sentence co-occurrence statistics, and two general-purpose neural sentence embedding models. We report the performance of the various cross-lingual models with different sizes of parallel corpora to assess the minimal degree of supervision required by each alignment framework. The George Washington University 2019-04-11 00:00:00.0 thesis http://pqdtopen.proquest.com/#viewpdf?dispub=13812118 EN
collection	NDLTD
language	EN
sources	NDLTD
topic	Language\|Computer science
spellingShingle	Language\|Computer science Aldarmaki, Hanan Cross-Lingual Alignment of Word & Sentence Embeddings
description	<p>One of the notable developments in current natural language processing is the practical efficacy of probabilistic word representations, where words are embedded in high-dimensional continuous vector spaces that are optimized to reflect their distributional relationships. For sequences of words, such as phrases and sentences, distributional representations can be estimated by combining word embeddings using arithmetic operations like vector averaging or by estimating composition parameters from data using various objective functions. The quality of these compositional representations is typically estimated by their performance as features in extrinsic supervised classification benchmarks. Word and compositional embeddings for a single language can be induced without supervision using a large training corpus of raw text. To handle multiple languages and dialects, bilingual dictionaries and parallel corpora are often used for learning cross-lingual embeddings directly or to align pre-trained monolingual embeddings. In this work, we explore and develop various cross-lingual alignment techniques, compare the performance of the resulting cross-lingual embeddings, and study their characteristics. We pay particular attention to the bilingual data requirements of each approach since lower requirements facilitate wider language expansion. To begin with, we analyze various monolingual general-purpose sentence embedding models to better understand their qualities. By comparing their performance on extrinsic evaluation benchmarks and unsupervised clustering, we infer the characteristics of the most dominant features in their respective vector spaces. We then look into various cross-lingual alignment frameworks with different degrees of supervision. We begin with unsupervised word alignment, for which we propose an approach for inducing cross-lingual word mappings with no prior bilingual resources. We rely on assumptions about the consistency and structural similarities between the monolingual vector spaces of different languages. Using comparable monolingual news corpora, our approach resulted in highly accurate word mappings for two language pairs: French to English, and Arabic to English. With various refinement heuristics, the performance of the unsupervised alignment methods approached the performance of supervised dictionary mapping. Finally, we develop and evaluate different alignment approaches based on parallel text. We show that incorporating context in the alignment process often leads to significant improvements in performance. At the word level, we explore the alignment of contextualized word embeddings that are dynamically generated for each sentence. At the sentence level, we develop and investigate three alignment frameworks: joint modeling, representation transfer, and sentence mapping, applied to different sentence embedding models. We experiment with a matrix factorization model based on word-sentence co-occurrence statistics, and two general-purpose neural sentence embedding models. We report the performance of the various cross-lingual models with different sizes of parallel corpora to assess the minimal degree of supervision required by each alignment framework.
author	Aldarmaki, Hanan
author_facet	Aldarmaki, Hanan
author_sort	Aldarmaki, Hanan
title	Cross-Lingual Alignment of Word & Sentence Embeddings
title_short	Cross-Lingual Alignment of Word & Sentence Embeddings
title_full	Cross-Lingual Alignment of Word & Sentence Embeddings
title_fullStr	Cross-Lingual Alignment of Word & Sentence Embeddings
title_full_unstemmed	Cross-Lingual Alignment of Word & Sentence Embeddings
title_sort	cross-lingual alignment of word & sentence embeddings
publisher	The George Washington University
publishDate	2019
url	http://pqdtopen.proquest.com/#viewpdf?dispub=13812118
work_keys_str_mv	AT aldarmakihanan crosslingualalignmentofwordsentenceembeddings
_version_	1719017511726350336

Cross-Lingual Alignment of Word & Sentence Embeddings

Similar Items