Multi-Benchmark Leaderboard Builder

🎯 场景：给一组模型在多个 benchmark 上的分数，做出带 caveats 的 leaderboard——加权排序、分项 ranking、识别"哪个 benchmark 是 differentiator"、每条 entry 标 strength / weakness。比简单平均靠谱。

Quick Use

Use when: You have model results across multiple benchmarks and want a leaderboard with weighted overall ranking, per-benchmark rankings, and analysis of where models are strong / weak. Fill in: {{model_results}} = JSON array of {model_name, benchmarks: {name: score}}; {{weights}} = optional benchmark weights. You'll get: Overall ranking with weighted scores, per-benchmark ranking, model strength/weakness profile, and notes on differentiators. Output is JSON.

Purpose

Aggregate model scores across multiple benchmarks into a usable leaderboard. Beyond simple averaging: weighted aggregation, per- benchmark sub-rankings, identification of differentiators (which benchmark separates good from bad models), and per-model strength/weakness profile. Used to summarize internal eval runs or to prepare external benchmark comparisons honestly.

Prompt

text

You build a leaderboard from multi-benchmark model results.

Model results:
{{model_results}}

Weights (may be empty for equal weights):
{{weights}}

Steps:
1. Normalize all scores to [0, 1] (if some are 0-100 and others
   0-1, pick the canonical scale and convert).

2. Apply weights — normalize weights to sum to 1. Compute
   weighted_score per model = sum(score_b × normalized_weight_b)
   for each benchmark b.

3. Rank models by weighted_score for the headline ranking. Compute
   per-benchmark sub-rankings.

4. For each model, identify "strengths" (benchmarks where it
   ranks in top 1-2) and "weaknesses" (benchmarks where it ranks
   in bottom 1-2). If only 2 models, skip this.

5. Identify "differentiator_benchmarks" — benchmarks where the
   score range across models is widest. These are the ones that
   actually distinguish; benchmarks where everyone scores ~similar
   carry less ranking signal.

6. Note any caveats — score ranges that are very narrow (saturated
   benchmark), missing scores, etc.

Return ONLY this JSON object:
{
  "leaderboard": [
    {
      "rank": <integer>,
      "model_name": "<name>",
      "weighted_score": <float>,
      "per_benchmark_rank": {"<benchmark>": <integer>},
      "strengths": ["<benchmark name>"],
      "weaknesses": ["<benchmark name>"]
    }
  ],
  "differentiator_benchmarks": [
    {"benchmark": "<name>", "score_range": <float>, "explanation": "<short>"}
  ],
  "saturated_benchmarks": ["<benchmarks where scores cluster too tightly to rank>"],
  "caveats": ["<short>"],
  "summary": "<one or two sentences on the headline finding>",
  "decision_basis": "<one sentence, <=30 words, no internal CoT>"
}

Example

Input:

text

model_results: [
  {"model_name": "ModelA", "benchmarks": {"factuality": 0.85, "reasoning": 0.72, "code": 0.90}},
  {"model_name": "ModelB", "benchmarks": {"factuality": 0.83, "reasoning": 0.85, "code": 0.65}},
  {"model_name": "ModelC", "benchmarks": {"factuality": 0.84, "reasoning": 0.70, "code": 0.55}}
]
weights: {"factuality": 1.0, "reasoning": 1.5, "code": 1.0}

Expected output:

json

{
  "leaderboard": [
    {"rank": 1, "model_name": "ModelB", "weighted_score": 0.797, "per_benchmark_rank": {"factuality": 3, "reasoning": 1, "code": 2}, "strengths": ["reasoning"], "weaknesses": ["code"]},
    {"rank": 2, "model_name": "ModelA", "weighted_score": 0.785, "per_benchmark_rank": {"factuality": 1, "reasoning": 2, "code": 1}, "strengths": ["factuality", "code"], "weaknesses": []},
    {"rank": 3, "model_name": "ModelC", "weighted_score": 0.731, "per_benchmark_rank": {"factuality": 2, "reasoning": 3, "code": 3}, "strengths": [], "weaknesses": ["code", "reasoning"]}
  ],
  "differentiator_benchmarks": [
    {"benchmark": "code", "score_range": 0.35, "explanation": "Code scores span 0.55-0.90 — biggest discriminator across models."},
    {"benchmark": "reasoning", "score_range": 0.15, "explanation": "Moderate spread on reasoning."}
  ],
  "saturated_benchmarks": ["factuality"],
  "caveats": ["Factuality scores all cluster 0.83-0.85 (range 0.02) — close to saturation, may not distinguish models meaningfully.", "ModelA leads in 2 of 3 benchmarks but ranks #2 due to higher weight on reasoning."],
  "summary": "ModelB takes #1 due to reasoning weight despite ModelA leading more benchmarks; code is the main differentiator across the trio.",
  "decision_basis": "Weighted aggregation favors reasoning; rank order reflects weight × score; factuality saturated and provides little signal."
}

Failure Modes

Saturation blindness — all models cluster on benchmark X but ranking pretends X distinguishes them. The saturated_benchmarks field is the safety net.
Weight misnormalization — weights summed wrong, weighted scores invalid. Validate at parse time.
Strength/weakness inflation — every model gets 5 strengths. Cap to top 1-2; not every dimension is a strength.
Missing-score handling — some models lack scores for some benchmarks. Should appear in caveats; verify.
Differentiator misranking — narrow-range benchmark labeled as differentiator. Verify score_range matches the actual range.

Tuning Notes

模型差异：本卡是数学聚合 + 文字总结；中档模型够用。
温度：0.0，统计聚合必须可重现。
数据规模：3-20 模型 × 3-15 benchmarks 是甜点。超过这个规模建议分维度做多个 leaderboard。
与 eval/regression-detector 的关系：那张卡是双版本对比；本卡是 N 模型对比。前者纵向（同一模型不同版本），后者横向。
与 eval/calibration-checker 的关系：那张卡审 confidence；本卡聚合 accuracy. 互补 — 一个模型可能 leaderboard 排第一但 calibration 差.
weights 选择: 默认 equal 是 honest baseline. 加 weights 应当有 business 理由（"我们更关心 reasoning 因为产品场景需要"）, 不是挑选有利权重让某模型赢.
saturated_benchmarks 处理: 报告时应当声明哪些 benchmark 已经 saturated, 不然 leaderboard 会 misleading（看似分数有差实际是噪声）.

Changelog

0.1.0 — Initial card.

Quick Use ​

Purpose ​

Prompt ​

Example ​

Failure Modes ​

Tuning Notes ​

Changelog ​