/ Publications /

Evaluation Pitfalls and Sparsity Limitations in LLM-based Confidence Estimates for Classification

Links

Paper

Authors

Elena Merdjanovska
Omar Zaidan
Andreas Rücklé

Elena Merdjanovska, Omar Zaidan, and Andreas Rücklé

Venue

ACL-Findings 2026

Abstract

Confidence estimation is essential when LLMs are used for classification, indicating when predictions can be trusted. However, common approaches such as verbalization produce extremely sparse outputs. For instance, Qwen3-32B verbalizes only eight unique confidence values on SST-2, with over half being exactly 95%—a pattern we observe consistently across four datasets and two LLMs. Besides limiting practical utility, we show that this sparsity critically affects evaluation: the choice of interpolation in area under the accuracy-rejection curve (AUARC) dramatically alters rankings, with consistency sampling dropping from best to worst under stepwise versus linear interpolation. We advocate for standardizing stepwise interpolation for a fairer comparison. Under such a fair evaluation, we find that weighting verbalized digits by token probabilities—a method we term verbalization logprobs—addresses sparsity and achieves the best AUARC (+2.3 points over vanilla verbalization) without incurring additional inference cost.

Bibtex

@inproceedings{merdjanovska-etal-2026-evaluation,
    title = "Evaluation Pitfalls and Sparsity Limitations in {LLM}-based Confidence Estimates for Classification",
    author = {Merdjanovska, Elena  and
      Zaidan, Omar  and
      R{\"u}ckl{\'e}, Andreas},
    booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {ACL} 2026",
    month = jul,
    year = "2026",
    address = "San Diego, California, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.findings-acl.1671/",
    pages = "33424--33435",
}