Bilingual distributed word representations from document-aligned comparable data

Vulić, I; Moens, MF

Bilingual distributed word representations from document-aligned comparable data

Published version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/294803

Repository DOI

https://doi.org/10.17863/CAM.9718

Files

Published version (769.94 KB)

Type

Article

Authors

Vulić, I

Moens, MF

Abstract

We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and contextcounting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.

Keywords

4605 Data Management and Data Science, 46 Information and Computing Sciences, Basic Behavioral and Social Science, Behavioral and Social Science, Clinical Research

Journal Title

Journal of Artificial Intelligence Research

Journal ISSN

1076-9757
1076-9757

Volume Title

55

Publisher

AI Access Foundation

Publisher DOI

https://doi.org/10.1613/jair.4986

Rights

http://www.rioxx.net/licenses/all-rights-reserved

Sponsorship

European Research Council (648909)

This work was done while Ivan Vuli c was a postdoctoral researcher at Department of Computer Science, KU Leuven supported by the PDM Kort fellowship (PDMK/14/117). The work was also supported by the SCATE project (IWT-SBO 130041) and the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (648909).

Collections

Cambridge University Research Outputs