Long-Context Pairwise Preference Labeler

🎯 场景：长输入 + 长输出场景的 pairwise 偏好——比短答 pairwise 难得多，judge 容易被长度 / 开头印象 / 局部细节带偏。本卡用结构化 multi-section 比较 + 局部 vs 全局加权 + 长输出特有的 failure mode 检测。

Quick Use

Use when: You're labeling preferences on long-form pairs (long context input + long-form responses, e.g. research summaries, long-form Q&A, multi-turn dialogue) where short-answer pairwise heuristics break down. Fill in: {{long_input}} = long context / document / conversation; {{response_a}} and {{response_b}} = the long-form responses to compare. You'll get: Per-section judgment, overall verdict, and detected long-context-specific failure modes (mid-content drift, hallucinated continuation). Output is JSON.

Purpose

Label pairwise preference for long-form responses to long inputs. Short-answer pairwise judging breaks down when responses are multi-section (the judge's overall verdict can be dominated by the opening or one prominent section). This card structures the comparison into per-section judgments + a weighted aggregation, and explicitly checks for long-form failure modes (lost-in-the-middle drift, hallucinated continuation, structural completeness). Used for RLHF data on long-form tasks: research summaries, document Q&A, code explanation of large codebases, multi-turn dialogue endings.

Prompt

text

You judge a pairwise preference between two long responses to a
long input. Be careful: long-form judging has specific failure
modes you need to actively avoid.

Long input:
{{long_input}}

Response A:
{{response_a}}

Response B:
{{response_b}}

Steps:
1. Identify the structural sections of each response (intro,
   sections / paragraphs, conclusion). For multi-section responses,
   you'll judge per-section. For unsectioned long prose, treat as
   first-half / second-half.

2. Per section / half, score 1-5 on:
   - groundedness    : Is this section accurately based on the
                        long_input?
   - relevance       : Does it advance answering the implicit goal of
                        the input?
   - quality         : Is the writing clear, non-redundant?

3. Identify long-form failure modes (mark which response, if any):
   - "mid_content_drift"     : Response loses thread / contradicts
                                earlier in itself.
   - "lost_in_middle"        : Response under-uses content from the
                                middle of long_input.
   - "hallucinated_continuation": Later sections invent content not
                                  in long_input.
   - "false_completeness"    : Response feels complete but actually
                                missed major topics.
   - "format_collapse"       : Loses structure mid-response (bullets
                                start as paragraphs, headings disappear).

4. Aggregate:
   - Per-response score = weighted average across sections, weighting
     each section by its evident importance to the input.
   - Pick winner / tie based on aggregate, BUT downgrade the winner's
     score if it has a major failure mode (lost-in-middle on a key
     topic = severe).

Return ONLY this JSON object:
{
  "section_judgments": [
    {
      "section_label": "<short>",
      "response_a": {"groundedness": 1-5, "relevance": 1-5, "quality": 1-5},
      "response_b": {"groundedness": 1-5, "relevance": 1-5, "quality": 1-5}
    }
  ],
  "long_form_failures": {
    "response_a": ["<failure mode>"],
    "response_b": ["<failure mode>"]
  },
  "aggregate_scores": {
    "response_a": <float>,
    "response_b": <float>
  },
  "verdict": "a" | "b" | "tie",
  "decision_basis": "<one or two sentences>"
}

Example

Input:

text

long_input: <a 5000-word research paper about Voyager mission instruments>
response_a: <800-word summary covering instruments, mission timeline, and scientific findings>
response_b: <800-word summary that opens strong but spends 60% on instrument minutiae and never reaches the scientific findings section>

Expected output:

json

{
  "section_judgments": [
    {
      "section_label": "Mission overview",
      "response_a": {"groundedness": 5, "relevance": 5, "quality": 4},
      "response_b": {"groundedness": 5, "relevance": 5, "quality": 5}
    },
    {
      "section_label": "Instrument details",
      "response_a": {"groundedness": 5, "relevance": 4, "quality": 4},
      "response_b": {"groundedness": 5, "relevance": 3, "quality": 4}
    },
    {
      "section_label": "Scientific findings",
      "response_a": {"groundedness": 5, "relevance": 5, "quality": 4},
      "response_b": {"groundedness": 1, "relevance": 1, "quality": 1}
    }
  ],
  "long_form_failures": {
    "response_a": [],
    "response_b": ["lost_in_middle", "false_completeness"]
  },
  "aggregate_scores": {"response_a": 4.6, "response_b": 3.0},
  "verdict": "a",
  "decision_basis": "Response A covers all three implicit-goal sections evenly; Response B opens strong but skips scientific findings entirely — a major lost-in-middle failure that drops it well below A."
}

Failure Modes

Opening-paragraph dominance — judge weights the first paragraph too heavily. Mitigation: per-section scoring + explicit aggregate.
Length bias on long-form — judge prefers the longer response even when it's worse. The structured per-section approach helps but doesn't eliminate; track length-vs-winner correlation on benchmarks.
Failure mode under-detection — judge marks no failures even on responses that obviously skip half the input's topics. Sample low- scoring responses and verify failure mode list is non-empty.
Section labeling drift — judge invents section labels not in the responses. Constrain to actual visible structure or first / second halves.

Tuning Notes

模型差异：必须 frontier 模型 with strong long-context handling (Claude / Gemini Pro 1.5 / GPT-4 Turbo+). 中档模型在 long-form judging 上经常 collapse 到 length-bias 或 first-paragraph bias.
温度：0.0。
输入长度：long_input + 两个 response 加起来通常 >10K tokens. 选 model 时确认 context window. 超出时按章节切分跑多次再融合（每次一段 long_input + 对应的两个 response 段）.
与 rlhf/pairwise-preference-labeler 的关系：pairwise 是一般 case, 本卡是 long-form 专用. 短答用 pairwise; long-form (>500 tokens per response) 用本卡.
与 eval/multi-turn-dialogue-judge 的关系：那张卡判 turn-by-turn 对话; 本卡判 single long response. 多轮长对话两者结合用.
在 RLHF 数据建设中：long-form preference data 比 short-form 贵 10-50x（标注成本和判官成本都高）. 本卡是降本的方法之一: structured judging 让判官更可靠, 减少 noise 让训练效率更高.

Changelog

0.1.0 — Initial card.

Quick Use ​

Purpose ​

Prompt ​

Example ​

Failure Modes ​

Tuning Notes ​

Changelog ​