Pitfalls in the Evaluation of Sentence Embeddings

Pitfalls

We revisit the common procedure for evaluating sentence embeddings and identify five pitfalls that need to be mitigated:

  1. Sentence embeddings of larger sizes typically yield better results when training a logistic regression classifier on top. At equal embedding size, some models have little to no advantage over average word embeddings.
  2. When testing for semantic similarity, Cosine similarity and Pearson correlation may give misleading results. For instance, we show that normalization can considerably narrow the gap between the best and worst models.
  3. Normalization of embeddings can impact the performances on supervised evaluation tasks and may lead to rank changes.
  4. Only reporting results with logistic regression classifiers may not represent a realistic extrinsic evaluation setup.
  5. We illustrate that it remains unclear to which extent downstream tasks benefit from the different properties defined by many probing tasks.

In our paper, we give several recommendations for avoiding these problems.

Abstract

Deep learning models continuously break new records across different NLP tasks. At the same time, their success exposes weaknesses of model evaluation. Here, we compile several key pitfalls of evaluation of sentence embeddings, a currently very popular NLP paradigm. These pitfalls include the comparison of embeddings of different sizes, normalization of embeddings, and the low (and diverging) correlations between transfer and probing tasks. Our motivation is to challenge the current evaluation of sentence embeddings and to provide an easy-to-access reference for future research. Based on our insights, we also recommend better practices for better future evaluations of sentence embeddings.

Bibtex

@inproceedings{eger-etal-2019-pitfalls,
    title = "Pitfalls in the Evaluation of Sentence Embeddings",
    author = {Eger, Steffen  and
      R{\"u}ckl{\'e}, Andreas  and
      Gurevych, Iryna},
    booktitle = "Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)",
    year = "2019",
    address = "Florence, Italy",
    url = "https://www.aclweb.org/anthology/W19-4308",
    pages = "55--60"
}