Why We Need New Evaluation Metrics for NLG

Novikova, Jekaterina; Dusek, Ondrej; Cercas Curry, Amanda; Rieser, Verena

Repository landing page

oai:pure.atira.dk:publications/df05d562-44dc-4e52-9723-9c3ba3df340f

Why We Need New Evaluation Metrics for NLG

Authors: Jekaterina Novikova
Ondrej Dusek
Amanda Cercas Curry
Verena Rieser
Publication date: 10 September 2017
Publisher: Association for Computational Linguistics
Doi

Abstract

The majority of NLG evaluation relies on automatic metrics, such as BLEU. In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly

Similar works

Full text

Heriot Watt Pure

oai:pure.atira.dk:publications...

Last time updated on 28/02/2020

This paper was published in Heriot Watt Pure.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.