Towards the necessity for debiasing natural language inference datasets

Mithun Paul Panenghat, Sandeep Suntwal, Faiz Rafique, Rebecca Sharp, Mihai Surdeanu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Modeling natural language inference is a challenging task. With large annotated data sets available it has now become feasible to train complex neural network based inference methods which achieve state of the art performance. However, it has been shown that these models also learn from the subtle biases inherent in these datasets (Gururangan et al., 2018). In this work we explore two techniques for delexicalization that modify the datasets in such a way that we can control the importance that neural-network based methods place on lexical entities. We demonstrate that the proposed methods not only maintain the performance in-domain but also improve performance in some out-of-domain settings. For example, when using the delexicalized version of the FEVER dataset, the in-domain performance of a state of the art neural network method dropped only by 1.12% while its out-of-domain performance on the FNC dataset improved by 4.63%. We release the delexicalized versions of three common datasets used in natural language inference. These datasets are delexicalized using two methods: one which replaces the lexical entities in an overlap-aware manner, and a second, which additionally incorporates semantic lifting of nouns and verbs to their WordNet hypernym synsets.

Original languageEnglish (US)
Title of host publicationLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages6883-6888
Number of pages6
ISBN (Electronic)9791095546344
StatePublished - 2020
Event12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France
Duration: May 11 2020May 16 2020

Publication series

NameLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Conference

Conference12th International Conference on Language Resources and Evaluation, LREC 2020
Country/TerritoryFrance
CityMarseille
Period5/11/205/16/20

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Library and Information Sciences
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Towards the necessity for debiasing natural language inference datasets'. Together they form a unique fingerprint.

Cite this