Pathologies of Neural Models Make Interpretations Difficult

Feng, Shi; Wallace, Eric; Grissom II, Alvin; Iyyer, Mohit; Rodriguez, Pedro; Boyd-Graber, Jordan

doi:10.18653/v1/D18-1407

Computer Science > Computation and Language

arXiv:1804.07781 (cs)

[Submitted on 20 Apr 2018 (v1), last revised 28 Aug 2018 (this version, v3)]

Title:Pathologies of Neural Models Make Interpretations Difficult

Authors:Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, Jordan Boyd-Graber

View PDF

Abstract:One way to interpret neural model predictions is to highlight the most important input features---for example, a heatmap visualization over the words in an input sentence. In existing interpretation methods for NLP, a word's importance is determined by either input perturbation---measuring the decrease in model confidence when that word is removed---or by the gradient with respect to that word. To understand the limitations of these methods, we use input reduction, which iteratively removes the least important word from the input. This exposes pathological behaviors of neural models: the remaining words appear nonsensical to humans and are not the ones determined as important by interpretation methods. As we confirm with human experiments, the reduced examples lack information to support the prediction of any label, but models still make the same predictions with high confidence. To explain these counterintuitive results, we draw connections to adversarial examples and confidence calibration: pathological behaviors reveal difficulties in interpreting neural models trained with maximum likelihood. To mitigate their deficiencies, we fine-tune the models by encouraging high entropy outputs on reduced examples. Fine-tuned models become more interpretable under input reduction without accuracy loss on regular examples.

Comments:	EMNLP 2018 camera ready
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:1804.07781 [cs.CL]
	(or arXiv:1804.07781v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1804.07781
Related DOI:	https://doi.org/10.18653/v1/D18-1407

Submission history

From: Shi Feng [view email]
[v1] Fri, 20 Apr 2018 18:18:06 UTC (1,405 KB)
[v2] Tue, 14 Aug 2018 17:01:55 UTC (1,040 KB)
[v3] Tue, 28 Aug 2018 15:53:19 UTC (6,183 KB)

Computer Science > Computation and Language

Title:Pathologies of Neural Models Make Interpretations Difficult

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Pathologies of Neural Models Make Interpretations Difficult

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators