LLM-as-Judge Rubric for Open-Ended Outputs

🎯 场景：开放式输出（长文 / 解释 / advice）的 LLM-as-judge 打分——固定 4 维度（factuality / instruction-following / coherence / completeness）+ 整体 verdict。无 gold 答案场景的主力打分卡。

Quick Use

Use when: You want a structured quality assessment of a single AI output across factuality / instruction-following / coherence / completeness. Fill in: {{task_description}} = what the AI was asked to do; {{model_output}} = the AI's response; {{reference_answer}} = optional gold reference (pass empty string if none). You'll get: Per-dimension scores 1-5, a holistic verdict, a one-sentence basis, and concrete issues. Output is JSON.

Purpose

Score a single open-ended model output against a rubric covering factuality, instruction-following, coherence, and completeness. Used for offline eval runs where exact-match metrics do not apply (long-form generation, advice, explanations). Output is a per-criterion score plus a holistic decision so you can both rank models and inspect specific weak dimensions.

Prompt

text

You are an evaluation judge. Score the model output on each rubric dimension
from 1 (poor) to 5 (excellent), then give a holistic verdict.

Rubric:
- factuality: Are factual claims correct? Treat unverifiable opinions
  neutrally; only penalize false claims.
- instruction_following: Did the output do what the task description asked,
  in the requested format and scope?
- coherence: Is the output well-structured, internally consistent, and
  readable?
- completeness: Does the output cover the points needed to fully respond,
  without obvious gaps?

Task description:
{{task_description}}

Model output:
{{model_output}}

Reference answer (may be empty if no reference exists):
{{reference_answer}}

Return ONLY this JSON object:
{
  "scores": {
    "factuality": 1-5,
    "instruction_following": 1-5,
    "coherence": 1-5,
    "completeness": 1-5
  },
  "verdict": "excellent" | "good" | "borderline" | "poor",
  "decision_basis": "<one sentence, <=40 words, no quotes from output>",
  "issues": ["<concrete issue 1>", "<concrete issue 2>"]
}

If reference_answer is empty, judge factuality based on common knowledge and
internal consistency only; do not penalize for not matching a missing reference.

Example

Input:

text

task_description: "Explain why the sky appears blue, in 2-3 sentences accessible to a curious 12-year-old."
model_output: "The sky looks blue because sunlight hits air molecules and scatters. Blue light scatters more than red because it has a shorter wavelength, so we see more blue from every direction in the sky."
reference_answer: ""

Expected output:

json

{
  "scores": {
    "factuality": 5,
    "instruction_following": 5,
    "coherence": 5,
    "completeness": 4
  },
  "verdict": "excellent",
  "decision_basis": "Accurate Rayleigh scattering explanation in two sentences at the requested register; could note sunset reddening for completeness.",
  "issues": ["Does not address why sunsets appear red, which would round out the explanation."]
}

Failure Modes

Self-preference bias — when judging output from the same model family, score is inflated. Mitigation: cross-judge with a different model family on a holdout slice; report the gap.
Reference hugging — when reference_answer is provided, the judge punishes any output that diverges in style even if substantively correct. Mitigation: explicit instruction in the prompt that style differences are not penalized; show this in few-shots if drift persists.
Verbosity reward — long outputs score higher on completeness regardless of substance. Mitigation: length-controlled eval slice; if completeness correlates >0.7 with length, retrain raters or add a "do not reward verbosity" line.
Score compression — judge gives 4 to almost everything. Mitigation: inspect distribution; if entropy is low, increase temperature slightly or add few-shots with explicit 1, 3, and 5 examples.

Tuning Notes

模型差异：判官模型应当强于（或至少不弱于）被判模型；用 7B 判 70B 通常得到压缩分布。常用配置：GPT-4 / Claude Sonnet 作为 judge。
温度：0.0，结果稳定性优先；如需多次采样取均值，再略提温度（0.3）。
agreement：上线前用 100 个人工 gold 样本算 quadratic-weighted kappa；低于 0.5 不建议作为单一指标，可作为多指标之一。
pair 与 pointwise 的关系：本卡是 pointwise；如需排序模型，pairwise （见 rlhf/pairwise-preference-labeler）通常一致性更高。

Changelog

0.1.0 — Initial card.

Quick Use ​

Purpose ​

Prompt ​

Example ​

Failure Modes ​

Tuning Notes ​

Changelog ​