Persona-Controlled Response Generator

🎯 场景：给定 instruction + 人设定义，生成符合该人设的回答。比 sft/response-generator 多一层人设约束——产品级 chat 模型 / 多 persona 系统 / 品牌语气 SFT 数据建设的标配。

Quick Use

Use when: You want responses that match a specific persona / brand voice / character — for chat-product training data, multi-persona systems, or branded assistants. Fill in: {{instruction}} = user instruction; {{persona_definition}} = paragraph defining the persona; {{persona_strictness}} = strict / balanced / loose. You'll get: A response in the persona, a check that persona was actually applied, and a flag if persona conflicted with content. Output is JSON.

Purpose

Generate a response that addresses an instruction WHILE staying in a defined persona. Used for training data where chat models need to adhere to a brand voice or character, and for multi-persona systems that switch personas dynamically. Distinct from sft/response-generator: that card optimizes for SFT-target quality without persona; this card adds persona constraints on top, with strictness controls for how to handle persona-vs-content conflict.

Prompt

text

You generate a response that BOTH addresses the instruction AND
stays in the defined persona.

Instruction:
{{instruction}}

Persona definition:
{{persona_definition}}

Persona strictness:
{{persona_strictness}}

Strictness meanings:
- "strict"   : Every dimension of the persona is enforced. If a
                persona constraint and the user's request conflict,
                the persona wins (e.g. persona says "never gives
                medical advice" — refuse even if asked).
- "balanced" : Persona shapes the response (voice, tone, formatting)
                but content correctness takes priority. Persona is
                bent slightly when needed for accuracy.
- "loose"    : Persona is a stylistic suggestion. If the persona
                pushes against good content, content wins.

Rules:
1. The response must address the instruction.
2. The response must apply the persona's voice / tone / format /
   prohibited_things specified in persona_definition.
3. If a conflict arises, follow the strictness setting:
   - strict: persona wins; may refuse the instruction.
   - balanced: produce the answer in adapted form; flag the bend.
   - loose: produce the best answer; flag any persona violations.
4. Verify your response actually applies the persona — if you'd
   produce the same text without the persona, you didn't use it.

Return ONLY this JSON object:
{
  "response": "<the persona-applied response>",
  "persona_applied_dimensions": ["<list of dimensions actively shaped, e.g. 'tone (warm)', 'format (no bullets)'>"],
  "persona_content_conflict": {
    "detected": true | false,
    "conflict_description": "<if detected: short description; else empty>",
    "resolution": "persona_won" | "content_won" | "compromised" | "none"
  },
  "would_refuse": true | false,
  "refusal_reason": "<if would_refuse=true under strict mode: which persona constraint triggered; else empty>",
  "decision_basis": "<one sentence, <=30 words, no internal CoT>"
}

Example

Input:

text

instruction: "What's a good restaurant for date night?"
persona_definition: "Voice: dry and direct, no enthusiasm. Speaks like a worldly bartender. Never recommends a specific restaurant by name (compliance constraint — we can't endorse). Always asks one clarifying question if location is unspecified. Avoids exclamation marks."
persona_strictness: "strict"

Expected output:

json

{
  "response": "Depends on the city and what 'good' means to you — atmosphere over food, or vice versa. I won't recommend specific places (we don't endorse), but I'll tell you what to look for: small enough that the kitchen knows what it's doing, dim enough that you have to lean in, and a wine list that isn't trying to impress. Where are you?",
  "persona_applied_dimensions": ["voice (dry, no enthusiasm)", "format (no exclamation marks)", "prohibited_things (no specific restaurant names)", "format (asks clarifying question on location)"],
  "persona_content_conflict": {
    "detected": true,
    "conflict_description": "User wants a restaurant recommendation; persona forbids naming specific places.",
    "resolution": "persona_won"
  },
  "would_refuse": false,
  "refusal_reason": "",
  "decision_basis": "Persona's anti-endorsement constraint applied (no restaurant names), but provided actionable criteria; clarifying question for location."
}

Failure Modes

Persona theater — model uses persona-flavored wording without shaping actual content (just adds "buddy" / drops emojis but doesn't change information selection). Detect by sampling outputs where persona_applied_dimensions is short or generic.
Persona ignored under loose — model treats loose as "ignore persona". Loose still expects best-effort persona application; track persona-applied dimension counts under each strictness.
Persona refuses helpful tasks — under strict, model refuses reasonable tasks because of overly-broad persona constraints. Audit would_refuse cases against the persona's actual prohibited_things.
Conflict detection blindness — persona forbids specifics, user asks for specifics, model produces specifics anyway. The persona_content_conflict.detected field is the safety net; verify cases where persona has hard constraints but detected=false.
Bend-then-pretend — model produces a softened persona application but reports resolution: persona_won. Audit resolution claims against actual response.
Tone drift mid-response — first sentence is in-persona, last paragraph drifts. Spot-check long responses; tone should hold through the whole response.

Tuning Notes

模型差异：必须 frontier 模型。中档模型在 persona application 上常退化为表面化（换个开场白），content 选择上不动。
温度：0.4–0.7，persona application 需要写作灵活性。
strictness 选择经验：
- strict：合规敏感（医疗 / 法律 / 财务建议 / 品牌不允许的事）
- balanced：客服 / 教育产品 chat（人设是底色，内容正确为主）
- loose：内部工具 / 调试 / 实验性 chat（不要让人设损害效率）
与 rlhf/persona-consistency-judge 的关系：本卡 generate persona response，那张卡 judge 生成的 response 是否真的符合 persona。数据建设循环：generate → judge → keep 高分。
与 sft/response-generator 的关系：当任务无 persona 约束时用那张卡（更通用）；有明确 persona 时用本卡。
与 rlhf/iterative-dpo-pair-generator 的关系：可以让本卡跑 N 个 persona 各产 response，作为 DPO 不同 persona 风格的训练对。
高敏 persona（合规人设、安全人设）：严格不要用 loose 模式—— loose 失去合规约束意义。

Changelog

0.1.0 — Initial card.

Quick Use ​

Purpose ​

Prompt ​

Example ​

Failure Modes ​

Tuning Notes ​

Changelog ​