🎯 场景:给两段 passage 合成"必须组合两者才能答"的多跳评测题。RAG eval-set 构建专用,避免单 passage 就能猜对的伪多跳。
Quick Use
Use when: You want to generate a multi-hop QA evaluation question from two related passages. Fill in: {{passage_a}} = first source passage; {{passage_b}} = second source passage covering a related but different fact. You'll get: A question that requires both passages to answer, plus the gold answer and supporting spans from each. Output is JSON.
Purpose
Generate a multi-hop evaluation question that requires combining information from two different passages — neither passage alone is sufficient to answer. Used to build RAG eval sets that stress-test retrievers and graders beyond single-passage lookup. Output is a JSON record with the question, gold answer, and the specific spans from each passage that must be combined.
Prompt
You design multi-hop evaluation questions. Read the two passages and create
ONE question that:
1. Requires a fact from passage A AND a fact from passage B to answer.
2. Cannot be answered correctly using only one passage.
3. Has a single, unambiguous answer in 1–10 words.
Passage A:
{{passage_a}}
Passage B:
{{passage_b}}
Return ONLY a JSON object:
{
"question": "<the multi-hop question>",
"answer": "<short gold answer>",
"supporting_span_a": "<verbatim span from passage A that contributes>",
"supporting_span_b": "<verbatim span from passage B that contributes>",
"hop_relation": "<one short phrase describing how A and B are bridged, e.g. 'shared organization', 'temporal sequence'>"
}
Do not invent facts not present in the passages. If the passages cannot
support a true multi-hop question, return:
{ "question": null, "answer": null, "supporting_span_a": "", "supporting_span_b": "", "hop_relation": "insufficient" }Example
Input:
passage_a: "Marie Curie won the Nobel Prize in Physics in 1903 alongside her husband Pierre and Henri Becquerel."
passage_b: "Henri Becquerel discovered spontaneous radioactivity in 1896 while studying uranium salts."Expected output:
{
"question": "Which discovery from 1896 contributed to the work that won the 1903 Nobel Prize in Physics?",
"answer": "spontaneous radioactivity",
"supporting_span_a": "Nobel Prize in Physics in 1903 alongside her husband Pierre and Henri Becquerel",
"supporting_span_b": "Henri Becquerel discovered spontaneous radioactivity in 1896",
"hop_relation": "shared person bridges date and discovery"
}Failure Modes
- Pseudo multi-hop — question can actually be answered from passage A alone; passage B is decorative. Detect by re-running an answerer with each passage individually and checking whether either is sufficient.
- Hallucinated bridge — model invents a connection not actually supported by the passages. The
supporting_span_a/supporting_span_brequirement catches most of these (they will be empty or paraphrased). - Degenerate questions — "What did X discover?" using only passage B. Mitigation: filter outputs where
hop_relation == "insufficient"or where spans look like single-passage answers.
Tuning Notes
- 模型差异:必须用强模型(GPT-4 级或 Claude Sonnet+),7B 模型生成的 multi-hop 90% 以上是伪 multi-hop。
- 温度:
0.7增加问题多样性;批量生成后再用单 passage 校验剔除伪 multi-hop。 - 数据准备:passage A 和 passage B 之间最好有"轻度桥接"——共享实体、时间、 组织。完全无关的 passage 对会迫使模型编造 bridge。
- 产出建议:每对 passage 生成 3–5 题再过滤,保留率通常 30–50%。
Changelog
0.1.0— Initial card.