🎯 场景:开放式输出(长文 / 解释 / advice)的 LLM-as-judge 打分——固定 4 维度(factuality / instruction-following / coherence / completeness)+ 整体 verdict。无 gold 答案场景的主力打分卡。
Quick Use
Use when: You want a structured quality assessment of a single AI output across factuality / instruction-following / coherence / completeness. Fill in: {{task_description}} = what the AI was asked to do; {{model_output}} = the AI's response; {{reference_answer}} = optional gold reference (pass empty string if none). You'll get: Per-dimension scores 1-5, a holistic verdict, a one-sentence basis, and concrete issues. Output is JSON.
Purpose
Score a single open-ended model output against a rubric covering factuality, instruction-following, coherence, and completeness. Used for offline eval runs where exact-match metrics do not apply (long-form generation, advice, explanations). Output is a per-criterion score plus a holistic decision so you can both rank models and inspect specific weak dimensions.
Prompt
You are an evaluation judge. Score the model output on each rubric dimension
from 1 (poor) to 5 (excellent), then give a holistic verdict.
Rubric:
- factuality: Are factual claims correct? Treat unverifiable opinions
neutrally; only penalize false claims.
- instruction_following: Did the output do what the task description asked,
in the requested format and scope?
- coherence: Is the output well-structured, internally consistent, and
readable?
- completeness: Does the output cover the points needed to fully respond,
without obvious gaps?
Task description:
{{task_description}}
Model output:
{{model_output}}
Reference answer (may be empty if no reference exists):
{{reference_answer}}
Return ONLY this JSON object:
{
"scores": {
"factuality": 1-5,
"instruction_following": 1-5,
"coherence": 1-5,
"completeness": 1-5
},
"verdict": "excellent" | "good" | "borderline" | "poor",
"decision_basis": "<one sentence, <=40 words, no quotes from output>",
"issues": ["<concrete issue 1>", "<concrete issue 2>"]
}
If reference_answer is empty, judge factuality based on common knowledge and
internal consistency only; do not penalize for not matching a missing reference.Example
Input:
task_description: "Explain why the sky appears blue, in 2-3 sentences accessible to a curious 12-year-old."
model_output: "The sky looks blue because sunlight hits air molecules and scatters. Blue light scatters more than red because it has a shorter wavelength, so we see more blue from every direction in the sky."
reference_answer: ""Expected output:
{
"scores": {
"factuality": 5,
"instruction_following": 5,
"coherence": 5,
"completeness": 4
},
"verdict": "excellent",
"decision_basis": "Accurate Rayleigh scattering explanation in two sentences at the requested register; could note sunset reddening for completeness.",
"issues": ["Does not address why sunsets appear red, which would round out the explanation."]
}Failure Modes
- Self-preference bias — when judging output from the same model family, score is inflated. Mitigation: cross-judge with a different model family on a holdout slice; report the gap.
- Reference hugging — when reference_answer is provided, the judge punishes any output that diverges in style even if substantively correct. Mitigation: explicit instruction in the prompt that style differences are not penalized; show this in few-shots if drift persists.
- Verbosity reward — long outputs score higher on completeness regardless of substance. Mitigation: length-controlled eval slice; if completeness correlates >0.7 with length, retrain raters or add a "do not reward verbosity" line.
- Score compression — judge gives 4 to almost everything. Mitigation: inspect distribution; if entropy is low, increase temperature slightly or add few-shots with explicit 1, 3, and 5 examples.
Tuning Notes
- 模型差异:判官模型应当强于(或至少不弱于)被判模型;用 7B 判 70B 通常 得到压缩分布。常用配置:GPT-4 / Claude Sonnet 作为 judge。
- 温度:
0.0,结果稳定性优先;如需多次采样取均值,再略提温度(0.3)。 - agreement:上线前用 100 个人工 gold 样本算 quadratic-weighted kappa; 低于 0.5 不建议作为单一指标,可作为多指标之一。
- pair 与 pointwise 的关系:本卡是 pointwise;如需排序模型,pairwise (见
rlhf/pairwise-preference-labeler)通常一致性更高。
Changelog
0.1.0— Initial card.