Pairwise Judge with Position-Bias Probe

🎯 场景：高质量 pairwise judging——双调用协议，A/B 跑一遍再 B/A 跑一遍，按 position_bias_protocol 合并结果。能把 10-25% 的 position bias 压到 5-10%。适合关键 RLHF 数据建设、benchmark eval。

Quick Use

Use when: You're running pairwise LLM-as-judge evaluation and want to detect / control for the well-known position bias (judge prefers whichever response is shown first). Fill in: {{task_description}} = what the task was; {{response_a}} and {{response_b}} = the two responses to compare. You'll get: A first-pass verdict, an instruction to re-run with positions swapped, and an interpretation rule for combining both runs into a robust verdict. Output is JSON.

Purpose

Run a pairwise comparison while explicitly instrumenting for position bias — the well-documented failure mode where an LLM judge prefers whichever response was shown first (or sometimes last) regardless of actual quality. This card is half of a two-call protocol: the first call produces a verdict on (A, B); the caller then re-invokes with swapped positions and uses this card's position_bias_protocol to combine the two verdicts into a robust judgment. Output is structured so per-task position-bias rates can be tracked across a benchmark.

Prompt

text

You are a pairwise judge that is aware of, and instrumented for,
position bias. Score this single (response_a, response_b) ordering;
the caller will run a second pass with positions swapped and combine
the verdicts.

Task description:
{{task_description}}

Response A:
{{response_a}}

Response B:
{{response_b}}

Return ONLY this JSON object:
{
  "verdict": "a" | "b" | "tie",
  "confidence": "high" | "medium" | "low",
  "decision_basis": "<one sentence, <=30 words, no internal CoT>",
  "swapped_call_instruction": "Re-invoke this card with response_a and response_b swapped. Then apply position_bias_protocol below to combine verdicts.",
  "position_bias_protocol": {
    "if_both_pick_same_letter": "Strong consensus on the WINNER. Treat the result as the verdict.",
    "if_picks_swap_with_positions": "Position bias detected — the judge picked first-position both times. Verdict is 'tie' regardless of either single call.",
    "if_one_run_is_tie": "Soft preference. Use the non-tie verdict but mark confidence as <=medium.",
    "if_picks_disagree_orthogonally": "Genuine disagreement: judge unstable on this pair. Mark as 'tie' or escalate to human."
  }
}

Calibration:
- Be willing to call ties when the responses are genuinely close.
  Forced binary decisions inflate position bias.
- Confidence "high" requires that you would expect the same verdict
  on a position-swap. If you suspect position influences your call,
  downgrade to "medium" or "low".

Example

Input:

text

task_description: "Explain in 2 sentences why the sky appears blue."
response_a: "The sky appears blue because of Rayleigh scattering: shorter blue wavelengths scatter more strongly off air molecules than longer red wavelengths, so we see more blue from every direction."
response_b: "Sky is blue because of the air. Blue color comes from the atmosphere reflecting sunlight."

Expected output:

json

{
  "verdict": "a",
  "confidence": "high",
  "decision_basis": "A names Rayleigh scattering and the wavelength mechanism; B says 'reflecting' which is technically wrong.",
  "swapped_call_instruction": "Re-invoke this card with response_a and response_b swapped. Then apply position_bias_protocol below to combine verdicts.",
  "position_bias_protocol": {
    "if_both_pick_same_letter": "Strong consensus on the WINNER. Treat the result as the verdict.",
    "if_picks_swap_with_positions": "Position bias detected — the judge picked first-position both times. Verdict is 'tie' regardless of either single call.",
    "if_one_run_is_tie": "Soft preference. Use the non-tie verdict but mark confidence as <=medium.",
    "if_picks_disagree_orthogonally": "Genuine disagreement: judge unstable on this pair. Mark as 'tie' or escalate to human."
  }
}

(In the second call, with A and B swapped, the judge would still pick the response that names Rayleigh scattering. Because the picks "swap with positions" — both calls picked the substantively-correct response by content rather than by position — that's the if_both_pick_same_letter outcome: strong consensus, robust verdict.)

Failure Modes

Caller skips the second call — turning a position-bias-aware protocol into a regular pairwise judge. Mitigation: make the caller's loop unconditionally do both passes; do not let users short-circuit when the first call is "high" confidence.
Position-bias confusion — caller misreads "picks swap with positions" as the success case. The protocol's wording is precise: if call 1 picks first-position AND call 2 picks first-position (different letters because positions swapped), THAT'S position bias. Verify by tracking pairs where both calls picked the same positional slot.
Tie inflation — judge defaults to "tie" to avoid commitment. Track tie rate against a benchmark of known-decisive pairs; if
30% on clearly-different responses, the rubric is too forgiving.
Confidence/verdict mismatch — high confidence but the second call disagrees orthogonally. Should be rare; if frequent, the judge is overconfident.
Format / length leakage into verdict — judge prefers the longer or more-formatted response regardless of substance. The position bias protocol catches POSITION confounds but not LENGTH/FORMAT confounds — those need separate length-controlled evals.

Tuning Notes

模型差异：strong judges (frontier-closed) 通常 position bias 在 10-25% 量级；中档 judges 在 30-50%。本卡的双向运行协议把上述误差压到 5-10%（由 if_picks_swap_with_positions 触发的 tie 兜底）。
温度：0.0，judging 必须可重现；这是 position bias 测量本身的前提。
调用经济性：本卡需要 2x 调用次数。在 RLHF 数据建设的关键阶段（reward model 训练数据）值得；在快速 dashboard / 排行榜场景上可以只跑单向，但应当报告未控制 position bias。
与 rlhf/pairwise-preference-labeler 的关系：那张卡是 RLHF 数据建设的主力（HHH 三维度，单向调用）；本卡是 eval 端的 robust pairwise（双向调用 + bias 检测）。两者使用场景互补：rlhf-pairwise 追求规模，本卡追求每次比较的可信度。
与 rlhf/best-of-n-selector 的关系：N>2 时用 best-of-N；N=2 且关心可信度时用本卡；N=2 且重视速度时用 rlhf-pairwise。
实施细节：caller loop 应该是确定性的（先 A/B 跑、再 B/A 跑），并对调用结果按 protocol 表合并，不让 LLM 自己决定怎么合并。
高敏 benchmark 上线前：跑 100-200 个人工 gold 样本测 raw position bias 率（单向调用与人工不一致率），用本卡测控后的不一致率。差值就是本卡的实际收益。

Changelog

0.1.0 — Initial card.

Quick Use ​

Purpose ​

Prompt ​

Example ​

Failure Modes ​

Tuning Notes ​

Changelog ​