Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

Edman, Lukas; Toral Ruiz, Antonio; Noord, van, Gertjan

Repository landing page

oai:pure.rug.nl:publications/680bb044-42d6-4f27-9a25-236912b32c17

Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

Authors: Lukas Edman
Antonio Toral Ruiz
van, Gertjan Noord
Publication date: 1 January 2020
Publisher

Abstract

Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding

contributionToPeriodical

Similar works

Full text

Open in the Core reader

Download PDF

ARTS repository - University of Groningen

oai:pure.rug.nl:publications/6...

Last time updated on 03/06/2022

This paper was published in ARTS repository - University of Groningen.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.