Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging: A Bayesian Non-Parametric Approach

We investigate the problem of unsupervised part-of-speech tagging when raw parallel data is available in a large number of languages. Patterns of ambiguity vary greatly across languages and therefore even unannotated multilingual data can serve as a learning signal. We propose a non-parametric Bayes...

Full description

Bibliographic Details
Main Authors: Snyder, Benjamin (Contributor), Naseem, Tahira (Contributor), Eisenstein, Jacob (Contributor), Barzilay, Regina (Contributor)
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory (Contributor), Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Contributor)
Format: Article
Language:English
Published: Association for Computational Linguistics, 2010-10-07T13:12:43Z.
Subjects:
Online Access:Get fulltext
Description
Summary:We investigate the problem of unsupervised part-of-speech tagging when raw parallel data is available in a large number of languages. Patterns of ambiguity vary greatly across languages and therefore even unannotated multilingual data can serve as a learning signal. We propose a non-parametric Bayesian model that connects related tagging decisions across languages through the use of multilingual latent variables. Our experiments show that performance improves steadily as the number of languages increases.
National Science Foundation (U.S.) (CAREER grant IIS-0448168)
National Science Foundation (U.S.) (CAREER grant IIS- 0835445)