A refusal-eval rubric for grounded document QA

Four buckets, one table, and why "correct" is the least interesting score in the set

“Refusal is a first-class capability” and “the rubric takes longer than you expect” are the kind of lines that sound rigorous and are also unfalsifiable unless someone actually writes the rubric down. So this post is the rubric itself: the four-bucket scoring model, the specific edge cases that trip it up, and the protocol I run before shipping a pipeline change.

The bucket problem

Most RAG evals collapse answers into “right” and “wrong” and then report accuracy or an F1 on exact-match. That was fine for SQuAD-shaped benchmarks with a single gold span per question. It is a disaster for production document QA, because the label space is wrong. The label space you actually need is:

Correct. The answer is factually right, the citation exists in the corpus, and the citation supports the answer.
Wrong. The answer is factually incorrect. The model committed to a claim the corpus does not support, or worse, one the corpus contradicts.
Unsupported. The answer may be correct in the outside world, but the retrieval set does not contain evidence for it. This is the category that punishes “model trying to be helpful.”
Refused. The system explicitly declined to answer and explained why: either retrieval returned nothing, or retrieval returned something irrelevant, or the question was outside the corpus’s scope. The user got a “I don’t have evidence for that, here’s what I do have” message and a next step.

The interesting property is that “correct” and “refused” are both acceptable outcomes, and “wrong” and “unsupported” are both unacceptable ones. A production system that is 60% correct / 35% refused / 3% wrong / 2% unsupported is better for regulated work than one that is 85% correct / 10% wrong / 5% unsupported. The aggregate “accuracy” of the first is 60%, the second is 85%, and the second will get your users fired.

If your eval doesn’t separate “wrong” from “unsupported,” it cannot distinguish “the model made something up” from “the retriever missed and the model knew enough not to guess.” Those are different engineering bugs with different fixes, and the rubric has to surface them separately or you will regress one while fixing the other and have no idea it happened.

The rubric in table form

Score every answer on five axes. Three of them are binary (yes/no), two are bounded integers. None of them require a human grader beyond the first pass; once the gold-standard pass is built the scoring is automatable with a strong grader model, and you compare grader-model outputs to the gold pass for calibration.

Axis	Type	What it asks
`factuality`	one of {correct, wrong, unsupported, refused}	The headline bucket, above.
`citation_exists`	yes / no	Every citation the answer produces maps to a real document/page/passage in the retrieval set.
`citation_supports`	yes / no	Every cited passage, read in isolation, supports the claim it is attached to.
`refusal_quality`	0-3	If the answer refused: did it explain why, offer a next step, and do both without false urgency? Score 0 if the refusal was a flat “I don’t know,” 3 if it was “I don’t have evidence for X; the closest passages I found are Y and Z, which address a related but different question; you might want to check the contracts archive for W.”
`extra_claim_count`	int	Count of claims in the answer that were not cited. A “claim” is any sentence that makes a factual assertion. Zero is the target. Non-zero is a latent citation-drift bug waiting to graduate to a “wrong” score.

The reason citation_exists and citation_supports are separate is that the two failure modes are completely different engineering problems. citation_exists: no means the generator hallucinated a source ID that was never in the retrieval set, which is a prompt/generation-layer bug and usually tractable. citation_exists: yes, citation_supports: no means the generator picked the right source ID but grounded the wrong claim against it, which is a reranker/passage-selection problem and usually much harder.

The protocol

For each pipeline change, the eval runs this way:

A held-out question set of 100 grounded questions, covering the four factuality buckets in approximately 40/20/20/20 proportions. The skew away from “correct” is deliberate: a 100-question set with 95 correct and 5 challenging questions has almost no signal on the thing that actually matters.
Ten unanswerable questions. Questions whose answer literally is not in the corpus, phrased plausibly. This is the single most useful signal for “is the model trying to be helpful?” and the single one most eval harnesses skip.
Ten adversarial questions that look like they’re in the corpus but aren’t (questions about adjacent entities, wrong dates, superficially similar topics).
Ten retrieval-null questions where we force the retriever to return nothing. These test the “what does it do when retrieval returns nothing” question directly.

The total question set is 130. A pipeline change ships only if:

factuality score moves in the right direction, broken down bucket by bucket.
wrong count does not increase for any reason. This is a hard gate. A change that adds 3 corrects and 1 wrong is a change that gets rejected.
unsupported count does not increase.
refusal_quality mean does not regress.
extra_claim_count sum does not increase.

Ship decisions that look like “the overall accuracy went up” get rejected at the bucket level all the time. That is the rubric doing its job.

Model-comparison shape

Specific numbers are corpus-dependent, so this is the shape of the comparison rather than a leaderboard. Across four broad classes of model, the pattern I’ve seen is consistent:

Strongest frontier models tend to have the best citation_supports scores and the best refusal_quality when the prompt explicitly licenses refusal. They are also the ones most sensitive to whether the prompt licenses refusal at all: given a prompt that does not explicitly say “refuse cleanly when you don’t have evidence,” a strong model will be more helpful than is safe for regulated work. The hosted Claude family is the one I’ve put the most deployment hours on.
Strong open-weight models at the large end (70B-plus) get close on factuality but consistently score lower on citation_supports and noticeably lower on refusal_quality. A good prompt closes some of the gap and none of it completely.
Mid-size open-weight models (13-30B) are competitive on easy questions and noticeably behind on adversarial and unanswerable ones. They are the models most likely to produce a confident wrong answer on the null-retrieval bucket.
Small models (under 10B) are useful for the sub-tasks (classifier, router, metadata tagger) but not for the final generation step in this class of system. They tend to treat the refusal-licensing language in the prompt as a suggestion rather than an instruction.

The single most important finding is mundane: prompt engineering on the refusal license moves the buckets more than swapping one strong model for another. If you only have budget to iterate on one thing, iterate on the prompt’s refusal license.

What I’d like to build next

A public micro-harness on a public corpus (ContractNLI, GovReport, or a redacted SEC EDGAR subset) that reproduces the four-bucket scoring on any pipeline in about ten minutes, with a calibration note on using a grader model against a human-labeled gold pass. Grader-model-as-judge is only as good as its calibration, and the calibration step is the thing most people skip.