Unsolved Problems in Natural Language Datasets

Garbage in, garbage out. You don’t have to be an ML expert to have heard this phrase. Models uncover patterns in the data, so when the data is broken, they develop broken behavior. This is why researchers allocate significant resources towards curating datasets. However, despite best efforts, it is nearly impossible to collect perfectly clean data, especially at the scale demanded by deep learning.

This article discusses popular natural language datasets that turned out to disobey fundamental principles of machine learning and data science, despite being produced by experts in the field. Some of these flaws were exposed and quantified years after the publication and intense usage of the datasets. This is to show that data collection and validation are arduous processes. Here are some of their main impediments:

Machine learning is data hungry. The sheer volume of data needed for ML (deep learning in particular) calls for automation, i.e., mining the Internet. Datasets end up inheriting undesirable properties from the Internet (e.g., duplication, statistical biases, falsehoods) that are non-trivial to detect and remove.
Desiderata cannot be captured exhaustively. Even in the presence of an oracle that could produce infinite data according to some predefined rules, it would be practically infeasible to enumerate all requirements. Consider the training data for a conversational bot. We can express general desiderata like diverse topics, respectful communication, or balanced exchange between interlocutors. But we don’t have enough imagination to specify all the relevant parameters.
Humans take the path of least resistance. Some data collection efforts are still manageable at human scale. But we ourselves are not flawless and, despite our best efforts, are subconsciously inclined to take shortcuts. If you were tasked to write a statement that contradicts the premise "The dog is sleeping", what would your answer be? Continue reading to find out whether you’d be part of the problem.

Overlapping training and evaluation sets

ML practitioners split their data three-ways: there’s a training set for actual learning, a validation set for hyperparameter tuning, and an evaluation set for measuring the final quality of the model. It is common knowledge that these sets should be mostly disjunct. When evaluating on training data, you are measuring the model’s capacity to memorize rather than its ability to recognize patterns and apply them in new contexts.

This guideline sounds straightforward to apply, yet Lewis et al. [1] show in a 2020 paper that the most popular open-domain question answering datasets (open-QA) have a significant overlap between their training and evaluation sets. Their analysis includes WebQuestions, TriviaQA and Open Natural Questions – datasets created by reputable institutions and heavily used as QA benchmarks.

We find that 60–70% of test-time answers are also present somewhere in the training sets. We also find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding training sets.

Of course, a 0% overlap between training and testing would not be ideal either. We do want some degree of memorization – models should be able to answer questions seen during training and know when to surface previously-seen answers. The real problem is benchmarking a model on a dataset with high training/evaluation overlap and making rushed conclusions about its generalization ability.

Lewis et al. [1] re-evaluate state-of-the-art QA models after partitioning the evaluation sets into three subsets: (a) question overlap – for which identical or paraphrased question-answer pairs occur in the training set, (b) _answer overlap only—_for which __ the same answers occur in the training set, but paired with a different question, and (c) _no overla_p. QA models score vastly differently across these three subsets. For instance, when tested on Open Natural Questions, the state-of-the-art Fusion-in-Decoder model scores ~70% on question overlap, ~50% on answer overlap only, ~35% on no overlap.

It is clear that performance on these datasets cannot be properly understood by overall QA accuracy and suggest that in future, a greater emphasis should be placed on more behaviour-driven evaluation, rather than pursuing single-number overall accuracy figures.

Spurious correlations

Just like humans, models take shortcuts and discover the simplest patterns that explain the data. For instance, consider a dog-vs-cat image classifier and a naïve training set in which all dog images are grayscale and all cat images are in full color. The model will most likely latch onto the spurious correlation between presence/absence of color and labels. When tested on a dog in full color, it will probably label it as a cat.

Gururangan et al. [2] showed that similar spurious correlations occur in two of the most popular natural language inference (NLI) datasets, SNLI (Stanford NLI) and MNLI (Multi-genre NLI). Given two statements, a premise and a hypothesis, the natural language inference task is to decide the relationship between them: entailment, contradiction or neutrality. Here is an example from the MNLI dataset:

Solving NLI requires understanding the subtle connection between the premise and the hypothesis. However, Gururangan et al. [2] revealed that, when models are shown the hypothesis alone, they can achieve accuracy as high as 67% on SNLI and 53% on MNLI. This is significantly higher than the most-frequent-class baseline (~35%), surfacing undeniable flaws in the datasets.

How did this happen? SNLI and MNLI were both crowd-sourced; humans were given a premise and asked to produce three hypotheses, one for each label. Which brings us back to the premise "The dog is sleeping". How would you contradict it? "The dog is not sleeping" is a perfectly reasonable candidate. However, if negation is consistently applied as a heuristic, models learn to detect contradiction by simply checking for the occurrence of "not" in the hypothesis, achieving high accuracy without even reading the premise.

Gururangan et al. [2] reveal several other such annotation artefacts:

Entailment hypotheses were produced by generalizing words found in the premise (dog → animal, 3 → some, woman → person), making entailment recognizable from the hypothesis alone.
Neutral hypotheses were produced by injecting modifiers (tall, first, most) __ as an easy way to introduce information not entailed by the premise but also not contradictory to it.

Despite these discoveries, MNLI remains under the GLUE leaderboard, one of the most popular benchmarks for natural language processing. Due to its considerable size compared to the other GLUE corpora (~400k data instances), MNLI is prominently featured in abstracts and used in ablation studies. While its shortcomings are starting to be recognized more widely, it is unlikely to lose its popularity until we find a better alternative.

Bias and under-representation

In the past few years, bias in machine learning has been exposed across multiple dimensions including gender and race. In response to biased word embeddings and model behavior, the research community has been directing increasingly more efforts towards bias mitigation, as illustrated by Sun et al. [3] in their comprehensive literature review.

Yann LeCun, co-recipient of the 2018 Turing Award, pointed out that biased data leads to biased model behavior:

His Tweet drew a lot of engagement from the research community, with mixed reactions. On the one hand, people acknowledged almost unanimously that bias does exist in many datasets. On the other hand, some disagreed with the perceived implication that bias stems solely from data, additionally blaming modeling and evaluation choices, and the unconscious bias of those who design and build the models. Yann LeCun later clarified that he does not consider data bias to be the only cause for societal bias in models:

Even though the dataset being discussed was an image corpus used for computer vision, natural language processing suffers no less from biased datasets. A prominent task that has exposed gender bias is coreference resolution, where a referring expression (like a pronoun) must be linked to an entity mentioned in the text. Here is an example from Webster et al. [4]:

In May, Fujisawa joined Mari Motohashi’s rink as the team’s skip, moving back from Karuizawa to Kitami where she had spent her junior days.

The authors point out that less than 15% of biographies on Wikipedia are about women, and that they tend to discuss marriage and divorce more prominently than pages about men. Given that many NLP datasets are extracted from Wikipedia, this impacts many downstream tasks. For coreference resolution in particular, the lack of female pronouns or their association with certain stereotypes is problematic. For instance, how would you interpret the sentence "Mary saw her doctor as she entered the room"?

Eliminating bias from the training data is an unsolved problem. First, because we cannot exhaustively enumerate the axes in which bias manifests; in addition to gender and race, there are many other subtle dimensions that can invite bias (age, proper names, profession etc.). Second, even if we selected a single axis like gender, removing bias would either mean dropping a large portion of the data or applying error-prone heuristics to turn male pronouns into under-represented gender pronouns. Instead, the research community is currently focusing on producing unbiased evaluation datasets, since their smaller scale is more conducive of manual intervention. This at least gives us the ability to measure the performance of our models more truthfully, across a representative sample of the population.

Building natural language datasets is a never-ending process: we continuously collect data, validate it, acknowledge its shortcomings and work around them. Then we rinse and repeat whenever a new source becomes available. And in the meantime we make progress. All the datasets mentioned above, despite their flaws, have undeniably helped push natural language understanding forward.