prompt-atlas Index
Auto-generated by
scripts/build_index.py. Do not edit by hand.
Total cards: 100
Cards by direction
RAG (rag)
| Card | Use when | Status | Tags |
|---|---|---|---|
| Answer Grounding Checker (hallucination detector) | You want to detect hallucinations in a RAG answer by checking each claim against the retrieved context | stable | grounding, factuality, scoring, structured-output, eval-set |
| Chunk Summarizer for Retrieval | You're building a RAG index and want to store a search-friendly summary alongside (or instead of) the raw chunk text | stable | retrieval, generation, structured-output, synthesis |
| Citation Faithfulness Scorer | You want to audit whether a citation actually supports the claim it was attached to | stable | citation, factuality, scoring, grounding, structured-output |
| Retrieved Context Compression | Your retriever returns more text than fits comfortably in the LLM's context window, or you want to focus the model's attention on the spans actually relevant to the question | stable | retrieval, generation, synthesis, structured-output |
| Conversational Query Resolver (rewrite follow-ups as standalone) | You're running RAG in a multi-turn conversation and the latest user turn references earlier context (pronouns, "that one", "what about...") — direct retrieval would fail | stable | query-rewriting, retrieval, structured-output, generation |
| HyDE — Hypothetical Answer Generator for Retrieval | You want to generate a hypothetical answer to embed and use as a search vector (HyDE technique) | experimental | retrieval, query-rewriting, generation, synthesis |
| Multi-Source Answer Aggregator (with conflict surfacing) | You have multiple retrieved sources and need to compose an answer that handles conflicts, complements, and redundancies between them — instead of cherry-picking one | stable | synthesis, retrieval, citation, structured-output |
| Multi-hop RAG Eval Question Synthesizer | You want to generate a multi-hop QA evaluation question from two related passages | experimental | multi-hop, synthesis, eval-set, generation |
| Query Fusion (combine sub-query results into ranked set) | You decomposed a query into sub-queries, retrieved separately, and now need to fuse the per-sub-query result sets into one deduplicated, ranked set | stable | retrieval, synthesis, structured-output, ranking |
| Query Rewriting and Decomposition for Retrieval | You want to split a single complex query into focused sub-queries before retrieval | stable | query-rewriting, retrieval, decomposition, structured-output |
| Retrieval Relevance Evaluator | You want to score whether a retrieved passage is relevant to a search query | stable | retrieval, scoring, eval-set, grounding |
| Structured RAG Output Builder (table / list / schema from evidence) | The user's question implies a STRUCTURED answer (comparison table, list, fielded record) and you want the answer in that exact shape with citations, not free prose | stable | synthesis, retrieval, structured-output, extraction |
| Time-Aware Retrieval Query Rewriter | Your RAG handles queries with time-relative phrases ("latest", "last quarter", "this year") and you need to resolve them into concrete time bounds before retrieval | stable | query-rewriting, retrieval, structured-output, generation |
Agent (agent)
| Card | Use when | Status | Tags |
|---|---|---|---|
| API Result to User-Readable Translator | Your agent called an API and got back structured data; you need to translate it into a user-readable answer that addresses the original question | stable | generation, structured-output, tool-use |
| API Spec to Tool Catalog Converter | You have an OpenAPI / Swagger / JSON Schema spec and want a tool catalog ready to paste into an agent loop, without hand-writing each tool | stable | tool-use, extraction, generation, structured-output |
| Budget-Aware Agent Planner | You're running an agent under explicit token or dollar budget and need a plan that completes the goal within budget — not a maximalist plan that overshoots | experimental | planning, decomposition, structured-output |
| Clarification Question Asker | Your agent receives an ambiguous goal and you need it to decide whether to proceed, ask one good clarifying question, or refuse | stable | planning, classification, structured-output |
| Error Recovery Strategy (retry / abort / escalate) | An agent operation just failed (tool call, API call, database query) and you need to decide whether to retry, abort the goal, or escalate to a human | stable | reflection, planning, structured-output |
| Long-Context Trajectory Memory Summarizer | Your agent's trajectory is approaching the context window limit and you need to compress earlier history into a structured memory record | stable | memory, generation, structured-output, decomposition |
| Multi-Agent Conflict Resolver | You're orchestrating multiple agents (sub-task delegation, parallel reasoning) and they produced conflicting outputs that need reconciliation before continuing | experimental | planning, classification, structured-output, decomposition |
| Plan-and-Execute Upfront Planner | Your agent's goal is predictable enough to plan upfront instead of step-by-step (linear dependencies, known sub-problems) | stable | planning, decomposition, structured-output |
| ReAct Planner with Strict Tool Call Schema | You want an agent to emit one strict-JSON tool call per step in a ReAct loop, with a visible reasoning summary | stable | planning, tool-use, react, structured-output, decomposition |
| Self-Critique Reflection Step for Agents | Your agent has taken several steps and you want a meta-level "are we on track" check before continuing | stable | reflection, self-check, planning, structured-output |
| Sub-Task Delegator (multi-agent prep) | A user task is too complex for one agent and you want to split it across specialized workers / agents (multi-agent foundation) | experimental | planning, decomposition, structured-output |
| Tool-Call Repair from Validation Error | A tool call failed schema validation and you want to repair it before escalating to a strategy reflection | stable | tool-use, structured-output, extraction |
| Tool Output Summarizer (compress before context) | A tool returned verbose output (JSON blob, file listing, search results, long fetch response) and you want to compress it down to what the agent actually needs for the next step | stable | generation, memory, structured-output |
RLHF (rlhf)
| Card | Use when | Status | Tags |
|---|---|---|---|
| Best-of-N Response Selector | You have N candidate responses to the same prompt and need to pick the best (for inference-time best-of-N or for RLHF preference data) | stable | ranking, scoring, helpfulness, harmlessness, structured-output |
| Constitutional Critique-and-Revise | You want Constitutional AI training data (or to clean up a response at inference) by critiquing it against principles and revising | stable | harmlessness, helpfulness, honesty, generation, structured-output, self-check |
| Helpfulness vs Harmlessness Tradeoff Scorer | You suspect a model's response sacrificed helpfulness for caution OR was helpful but unsafe — the inherent HHH tradeoff axis. Critical for diagnosing over-aligned / under-aligne... | experimental | helpfulness, harmlessness, scoring, structured-output |
| Iterative DPO Pair Generator | You're doing iterative DPO and need to generate (chosen, rejected) pairs targeting a specific behavioral principle, using the current model's response as the rejected baseline | experimental | preference-labeling, pairwise, generation, helpfulness, structured-output |
| Long-Context Pairwise Preference Labeler | You're labeling preferences on long-form pairs (long context input + long-form responses, e.g. research summaries, long-form Q&A, multi-turn dialogue) where short-answer pairwis... | experimental | preference-labeling, pairwise, structured-output |
| Pairwise Preference Labeler (HHH dimensions) | You have two AI responses to the same prompt and want a preference label across helpful / harmless / honest dimensions | stable | preference-labeling, pairwise, helpfulness, harmlessness, honesty, scoring |
| Persona Consistency Judge | You're training or evaluating a model that should adhere to a defined persona / character / brand voice, and need to detect drift from that persona | stable | llm-judge, scoring, structured-output, classification |
| Pointwise Reward Scorer (single response → reward signal) | You only have a single response (no comparison pair) and need a scalar reward signal for reward model training data | experimental | reward-modeling, scoring, helpfulness, harmlessness, honesty, structured-output |
| Preference Rationalization Judge | You're auditing the quality of preference labels in your RLHF dataset and want to detect rationales that don't actually justify the labeler's pick — sign of noisy or rushed labe... | experimental | llm-judge, scoring, classification, structured-output |
| Red-Team Prompt Generator (defensive safety probes) | You're building a safety eval set or RLHF refusal-training dataset and need to probe a model's refusal behavior on a specific harm category. Defensive use only. | experimental | safety, harmlessness, generation, structured-output |
| Refusal Calibration Probe | You're evaluating whether a model refuses appropriately — neither over-refusing benign requests nor under-refusing genuinely unsafe ones. Critical for RLHF refusal-training data... | experimental | harmlessness, scoring, classification, structured-output |
| Reward Hacking Detector | You suspect a post-RLHF model is gaming the reward signal — looking good to the reward model but providing low actual value to users. Critical pre-launch diagnostic | experimental | reward-modeling, scoring, classification, structured-output |
SFT (sft)
| Card | Use when | Status | Tags |
|---|---|---|---|
| Code SFT Pair Generator | You're building code SFT training data and need (instruction, response) pairs at controlled difficulty in a target language | stable | instruction-tuning, generation, data-augmentation, structured-output |
| Multi-Turn Conversation SFT Data Generator | You're training a chat model and need multi-turn conversation SFT data, not just single-turn QA pairs | experimental | instruction-tuning, generation, structured-output, data-augmentation |
| SFT Data Coverage Analyzer | You have an SFT dataset (or a sample of it) and want to know whether it covers the topics / skills it's supposed to, or whether it's lopsided | experimental | classification, scoring, structured-output, instruction-tuning |
| SFT Data Quality Filter | You have candidate (instruction, response) SFT pairs and want to filter them by quality before training | stable | instruction-tuning, scoring, classification, structured-output, safety |
| Few-Shot Example Selector (pick best K demonstrations from a pool) | You have a pool of (instruction, response) demonstrations and want to pick the best K for few-shot prompting a specific target query | stable | instruction-tuning, classification, ranking, structured-output |
| SFT Instruction Deduplicator | You have an SFT instruction dataset and want to find near-duplicates at the SEMANTIC level (paraphrases, synonymous tasks) — not just exact-string duplicates | stable | classification, instruction-tuning, structured-output |
| Instruction Difficulty Classifier | You're building curriculum training data, stratifying a benchmark, or selecting active-learning candidates and need a per-instruction difficulty label calibrated to a target mod... | stable | classification, scoring, instruction-tuning, structured-output |
| Instruction Variant Expander (seed → diverse rewrites) | You want to rewrite ONE instruction into N variants that preserve the underlying task but vary surface form, register, or style | stable | instruction-tuning, seed-expansion, data-augmentation, generation |
| Persona-Controlled Response Generator | You want responses that match a specific persona / brand voice / character — for chat-product training data, multi-persona systems, or branded assistants | stable | instruction-tuning, generation, structured-output |
| SFT Response Generator (instruction → high-quality response) | You need to produce the response half of an SFT pair given an instruction | stable | instruction-tuning, generation, structured-output |
| Self-Instruct — Generate New Instructions from a Seed Bank | You have a small bank of seed instructions and want to generate NEW instructions in the same task family (Self-Instruct technique) | stable | instruction-tuning, seed-expansion, generation, data-augmentation, structured-output |
| Style Transfer (rewrite text in target style) | You want to rewrite text into a specific style (formal/casual, terse/elaborate, persona-flavored) while controlling whether semantic meaning must be preserved exactly | stable | generation, data-augmentation, structured-output |
Multimodal (multimodal)
| Card | Use when | Status | Tags |
|---|---|---|---|
| Chart and Table Extractor | You have an image of a chart, plot, or table (e.g. from a paper, dashboard screenshot, slide deck) and want the data as a structured object | stable | vision, extraction, structured-output, vlm-eval |
| Diagram to Structured Data | You have an image of a diagram (flowchart, architecture, ER, sequence, etc.) and want to extract its structure as nodes + edges, not free text | stable | vision, extraction, structured-output |
| Document Layout Analyzer | You want to understand a document page's STRUCTURE (where headers, body, tables, images live; what reading order is) — not extract specific fields | stable | vision, extraction, structured-output, vlm-eval |
| Handwriting Transcriber with Per-Word Confidence | You have an image of handwritten text (notes, forms, whiteboard, captured letters) and want a transcription with per-word confidence so downstream code can flag words for review | experimental | vision, ocr, extraction, structured-output |
| Custom-Category Image Classification | You want to classify images into your own custom categories (not a fixed pretrained label set) — content moderation, product catalog tagging, support-ticket image routing | stable | vision, classification, structured-output |
| Image Pair Comparison Explainer | You have two images and want a structured explanation of their similarities and/or differences (UI A/B comparison, product photo comparison, design variant analysis) | stable | vision, comparative, structured-output |
| Image Edit Instruction Generator (before/after to instruction) | You have a before/after image pair and want to generate the natural-language edit instruction that would produce the change — for image-edit-model training data, design diffs, o... | experimental | vision, generation, structured-output |
| OCR + Structured Extraction from Document Images | You want to extract a fixed set of typed fields from a document image (receipt, invoice, form, ID page) | stable | vision, ocr, extraction, structured-output |
| UI Screenshot to Component Spec | You have a UI screenshot (web/mobile/wireframe) and want a structured component spec — component tree, layout, interactions — instead of free-text description or raw code | experimental | vision, extraction, structured-output |
| Structured Image Caption Generator | You want a structured caption for an image — discrete fields like scene, subject, objects, action — instead of free-form text | stable | vision, image-description, generation, structured-output, extraction |
| VLM Image Description Verifier | You have a candidate image caption and want to audit which claims actually match the image | experimental | vision, image-description, vlm-eval, factuality, scoring |
| Visual Question Answering with Grounding and Confidence | You want to answer a question about an image AND know whether the image actually supports the answer (with grounding region and confidence) | stable | vision, vlm-eval, structured-output, factuality, scoring |
Chain-of-Thought (cot)
| Card | Use when | Status | Tags |
|---|---|---|---|
| Citation-Grounded Reasoning (every claim must cite) | You need reasoning where EVERY factual claim cites a source — for academic, legal, medical, or compliance contexts where unsourced claims are unacceptable | stable | structured-reasoning, citation, grounding, structured-output |
| Contrastive Self-Consistency (compare against intentionally-wrong) | A question has plausible wrong reasoning paths (common confusions, popular misconceptions) and you want the model to actively contrast its answer against the wrong one for stron... | experimental | self-check, structured-reasoning, scoring, structured-output |
| Least-to-Most Decomposition | A complex problem can be solved by breaking it into a chain of strictly easier sub-problems where each can use earlier answers | stable | decomposition-cot, structured-reasoning, structured-output |
| Meta-Prompt Generator (generate prompts for a class of tasks) | You're starting a new prompt-engineering task and want a meta-prompt template generated from a description + examples, instead of writing it from scratch | experimental | generation, structured-output, instruction-tuning |
| Plan Critique and Revise | You've generated a plan (least-to-most decomposition, agent plan, multi-step reasoning) and want to critique-and-revise before execution to catch issues cheaply | stable | self-check, structured-reasoning, structured-output, decomposition-cot |
| Self-Consistency Aggregator (majority vote over reasoning paths) | You've sampled N candidate answers to the same question (with temperature) and want to take a majority vote | stable | structured-reasoning, self-check, rationale-summary, structured-output |
| Self-Correction Protocol (accept / correct / reject) | You have a candidate answer and external criticism (from another model, a human reviewer, or a rule check) and need to decide whether to accept, correct, or reject the candidate | stable | self-check, structured-reasoning, structured-output, classification |
| Step-Back Prompting (abstract first, then solve) | A question's surface details might mislead direct reasoning, and reasoning from a more general principle would be more reliable | stable | structured-reasoning, decomposition-cot, structured-output |
| Structured Reasoning with Rationale Summary | You want the model to decompose its reasoning into named sub-steps and emit a summary rationale (not hidden chain-of-thought) | stable | structured-reasoning, rationale-summary, decomposition-cot, structured-output |
| Tree-of-Thoughts (branch + evaluate + prune) | A problem has multiple plausible reasoning paths and a single linear chain might miss the right one — combinatorial planning, search, design problems with trade-offs | experimental | decomposition-cot, self-check, structured-reasoning, structured-output |
| Reasoning with Explicit Uncertainty Quantification | You need not just an answer but a calibrated sense of which parts of the reasoning are solid vs guessed — for high-stakes decisions, scientific Q&A, or claims with downstream co... | experimental | structured-reasoning, self-check, scoring, structured-output |
| Verify-Then-Finalize (self-check before commit) | A task is error-prone (math, units, edge cases) and you want a draft + explicit verification before committing to the final answer | stable | self-check, structured-reasoning, factuality, structured-output |
Evaluation (eval)
| Card | Use when | Status | Tags |
|---|---|---|---|
| Calibration Checker (predicted confidence vs actual accuracy) | You have a batch of model outputs with both predicted confidence AND actual correctness labels, and you want to check whether the confidence is calibrated (high-confidence outpu... | stable | llm-judge, scoring, comparative, structured-output |
| Human Eval Study Bootstrap | You're standing up a human eval study for a task and want a structured study design (rubric, annotator instructions, sample size guidance, analysis plan) instead of figuring it... | experimental | llm-judge, rubric, structured-output |
| LLM Judge Bias Probe (length / position / format / verbosity) | You're using an LLM as a judge in production and want to verify it doesn't have systemic bias on length / position / format dimensions before trusting its scores | experimental | llm-judge, scoring, classification, structured-output |
| Multi-Benchmark Leaderboard Builder | You have model results across multiple benchmarks and want a leaderboard with weighted overall ranking, per-benchmark rankings, and analysis of where models are strong / weak | stable | comparative, scoring, structured-output |
| LLM-as-Judge Rubric for Open-Ended Outputs | You want a structured quality assessment of a single AI output across factuality / instruction-following / coherence / completeness | stable | llm-judge, rubric, holistic, scoring, factuality, coherence |
| Multi-Turn Dialogue Judge | You're evaluating a chat model and need to judge a multi-turn conversation, not just a single response | stable | llm-judge, rubric, scoring, holistic, structured-output, coherence |
| Pairwise Judge with Position-Bias Probe | You're running pairwise LLM-as-judge evaluation and want to detect / control for the well-known position bias (judge prefers whichever response is shown first) | stable | llm-judge, pairwise, comparative, scoring, structured-output |
| Per-claim Factuality Judge (atomic decomposition) | You want fine-grained factuality labels (true / false / unverifiable) for every atomic factual claim in an AI output | stable | llm-judge, factuality, scoring, structured-output, extraction |
| Pointwise Quality Scorer with Confidence | You want to score a single AI output on YOUR custom dimensions (not a fixed rubric) with self-reported confidence | stable | llm-judge, scoring, holistic, structured-output, coherence |
| Reference-based Judge (output vs gold) | You're scoring closed-form outputs (short-answer QA, math, structured extraction) against a known gold answer | stable | llm-judge, scoring, factuality, comparative, structured-output |
| Output-Level Regression Detector | You're testing a candidate model / prompt change against a baseline and need to detect quality regressions on specific dimensions, not just an overall vibe check | stable | llm-judge, scoring, comparative, structured-output |
| Domain-Specific Rubric Generator | You're starting a new evaluation task and want a structured rubric (with concrete level descriptions per dimension) instead of writing it by hand | experimental | rubric, generation, structured-output |
| Safety Output Classifier (defensive) | You want to classify whether an AI output should be allowed, reviewed, or blocked along an explicit harm taxonomy (defensive use only) | stable | safety, harmlessness, classification, llm-judge, structured-output |
Code (code)
| Card | Use when | Status | Tags |
|---|---|---|---|
| API Design Reviewer (REST / GraphQL / gRPC) | You're reviewing an API design (not implementation) and want structured findings on consistency, ergonomics, evolvability, security, and performance — calibrated to the API style | stable | code-review, scoring, structured-output |
| Code Evaluation Judge | You're evaluating AI-generated or contributor-submitted code against a task description (and optionally a reference solution + test cases) | stable | llm-judge, scoring, factuality, structured-output |
| Code Explanation Generator (audience-aware) | You want to explain a piece of code to a specific audience (new hire / PM / domain expert) at the right level — not too basic, not too jargon-heavy | stable | documentation, generation, structured-output |
| Code Review Checklist (structured findings) | You want a structured code review with per-dimension findings instead of a free-text "this looks good" reply | stable | code-review, scoring, structured-output, classification |
| Code Diff Summary for Pull Request | You have a git diff and want a structured PR description (summary, change list, risks, test suggestions) instead of free-form prose | stable | documentation, generation, structured-output |
| Code Translation Across Languages | You want to translate code from one language to another, with explicit control over how aggressively to adopt the target language's idioms | stable | generation, structured-output, extraction |
| Conventional Commit Message Generator | You're writing a commit message and want it generated from the diff in a specific style — conventional commits, simple imperative, or verbose with body | stable | documentation, generation, structured-output |
| Dependency Impact Analyzer | You're planning to change a function signature / API contract / shared type and want to know what breaks before you start | experimental | extraction, classification, structured-output |
| Error Message / Stack Trace Explainer | You have a confusing error message / stack trace and want a structured explanation calibrated to a specific audience (junior dev / senior / PM) | stable | documentation, generation, structured-output |
| Code Migration Plan Generator | You're planning a major version migration (framework upgrade, runtime upgrade, API spec migration) and want a phased plan based on your actual code rather than generic upgrade-g... | stable | code-review, generation, structured-output, decomposition |
| Refactor Suggestion (with rationale and diff hint) | You have working code that's not optimal on a specific axis (readability, performance, testability, modularity, type safety) and want concrete refactor suggestions with rationale | stable | code-review, generation, structured-output |
| Code Security Review (focused) | You want a focused security review (not generic code review) — looking specifically for vulnerabilities given a threat model | stable | code-review, scoring, classification, structured-output |
| Test Case Generator | You want to generate test cases for a function or class with explicit coverage of happy path, edge cases, and error handling | stable | test-generation, generation, structured-output |
Cards by tag
citation— Citation-Grounded Reasoning (every claim must cite), Citation Faithfulness Scorer, Multi-Source Answer Aggregator (with conflict surfacing)classification— Clarification Question Asker, Multi-Agent Conflict Resolver, Code Review Checklist (structured findings), Dependency Impact Analyzer, Code Security Review (focused), Self-Correction Protocol (accept / correct / reject), LLM Judge Bias Probe (length / position / format / verbosity), Safety Output Classifier (defensive), Custom-Category Image Classification, Persona Consistency Judge, Preference Rationalization Judge, Refusal Calibration Probe, Reward Hacking Detector, SFT Data Coverage Analyzer, SFT Data Quality Filter, Few-Shot Example Selector (pick best K demonstrations from a pool), SFT Instruction Deduplicator, Instruction Difficulty Classifiercode-review— API Design Reviewer (REST / GraphQL / gRPC), Code Review Checklist (structured findings), Code Migration Plan Generator, Refactor Suggestion (with rationale and diff hint), Code Security Review (focused)coherence— LLM-as-Judge Rubric for Open-Ended Outputs, Multi-Turn Dialogue Judge, Pointwise Quality Scorer with Confidencecomparative— Calibration Checker (predicted confidence vs actual accuracy), Multi-Benchmark Leaderboard Builder, Pairwise Judge with Position-Bias Probe, Reference-based Judge (output vs gold), Output-Level Regression Detector, Image Pair Comparison Explainerdata-augmentation— Code SFT Pair Generator, Multi-Turn Conversation SFT Data Generator, Instruction Variant Expander (seed → diverse rewrites), Self-Instruct — Generate New Instructions from a Seed Bank, Style Transfer (rewrite text in target style)decomposition— Budget-Aware Agent Planner, Long-Context Trajectory Memory Summarizer, Multi-Agent Conflict Resolver, Plan-and-Execute Upfront Planner, ReAct Planner with Strict Tool Call Schema, Sub-Task Delegator (multi-agent prep), Code Migration Plan Generator, Query Rewriting and Decomposition for Retrievaldecomposition-cot— Least-to-Most Decomposition, Plan Critique and Revise, Step-Back Prompting (abstract first, then solve), Structured Reasoning with Rationale Summary, Tree-of-Thoughts (branch + evaluate + prune)documentation— Code Explanation Generator (audience-aware), Code Diff Summary for Pull Request, Conventional Commit Message Generator, Error Message / Stack Trace Explainereval-set— Answer Grounding Checker (hallucination detector), Multi-hop RAG Eval Question Synthesizer, Retrieval Relevance Evaluatorextraction— API Spec to Tool Catalog Converter, Tool-Call Repair from Validation Error, Code Translation Across Languages, Dependency Impact Analyzer, Per-claim Factuality Judge (atomic decomposition), Chart and Table Extractor, Diagram to Structured Data, Document Layout Analyzer, Handwriting Transcriber with Per-Word Confidence, OCR + Structured Extraction from Document Images, UI Screenshot to Component Spec, Structured Image Caption Generator, Structured RAG Output Builder (table / list / schema from evidence)factuality— Code Evaluation Judge, Verify-Then-Finalize (self-check before commit), LLM-as-Judge Rubric for Open-Ended Outputs, Per-claim Factuality Judge (atomic decomposition), Reference-based Judge (output vs gold), VLM Image Description Verifier, Visual Question Answering with Grounding and Confidence, Answer Grounding Checker (hallucination detector), Citation Faithfulness Scorergeneration— API Result to User-Readable Translator, API Spec to Tool Catalog Converter, Long-Context Trajectory Memory Summarizer, Tool Output Summarizer (compress before context), Code Explanation Generator (audience-aware), Code Diff Summary for Pull Request, Code Translation Across Languages, Conventional Commit Message Generator, Error Message / Stack Trace Explainer, Code Migration Plan Generator, Refactor Suggestion (with rationale and diff hint), Test Case Generator, Meta-Prompt Generator (generate prompts for a class of tasks), Domain-Specific Rubric Generator, Image Edit Instruction Generator (before/after to instruction), Structured Image Caption Generator, Chunk Summarizer for Retrieval, Retrieved Context Compression, Conversational Query Resolver (rewrite follow-ups as standalone), HyDE — Hypothetical Answer Generator for Retrieval, Multi-hop RAG Eval Question Synthesizer, Time-Aware Retrieval Query Rewriter, Constitutional Critique-and-Revise, Iterative DPO Pair Generator, Red-Team Prompt Generator (defensive safety probes), Code SFT Pair Generator, Multi-Turn Conversation SFT Data Generator, Instruction Variant Expander (seed → diverse rewrites), Persona-Controlled Response Generator, SFT Response Generator (instruction → high-quality response), Self-Instruct — Generate New Instructions from a Seed Bank, Style Transfer (rewrite text in target style)grounding— Citation-Grounded Reasoning (every claim must cite), Answer Grounding Checker (hallucination detector), Citation Faithfulness Scorer, Retrieval Relevance Evaluatorharmlessness— Safety Output Classifier (defensive), Best-of-N Response Selector, Constitutional Critique-and-Revise, Helpfulness vs Harmlessness Tradeoff Scorer, Pairwise Preference Labeler (HHH dimensions), Pointwise Reward Scorer (single response → reward signal), Red-Team Prompt Generator (defensive safety probes), Refusal Calibration Probehelpfulness— Best-of-N Response Selector, Constitutional Critique-and-Revise, Helpfulness vs Harmlessness Tradeoff Scorer, Iterative DPO Pair Generator, Pairwise Preference Labeler (HHH dimensions), Pointwise Reward Scorer (single response → reward signal)holistic— LLM-as-Judge Rubric for Open-Ended Outputs, Multi-Turn Dialogue Judge, Pointwise Quality Scorer with Confidencehonesty— Constitutional Critique-and-Revise, Pairwise Preference Labeler (HHH dimensions), Pointwise Reward Scorer (single response → reward signal)image-description— Structured Image Caption Generator, VLM Image Description Verifierinstruction-tuning— Meta-Prompt Generator (generate prompts for a class of tasks), Code SFT Pair Generator, Multi-Turn Conversation SFT Data Generator, SFT Data Coverage Analyzer, SFT Data Quality Filter, Few-Shot Example Selector (pick best K demonstrations from a pool), SFT Instruction Deduplicator, Instruction Difficulty Classifier, Instruction Variant Expander (seed → diverse rewrites), Persona-Controlled Response Generator, SFT Response Generator (instruction → high-quality response), Self-Instruct — Generate New Instructions from a Seed Bankllm-judge— Code Evaluation Judge, Calibration Checker (predicted confidence vs actual accuracy), Human Eval Study Bootstrap, LLM Judge Bias Probe (length / position / format / verbosity), LLM-as-Judge Rubric for Open-Ended Outputs, Multi-Turn Dialogue Judge, Pairwise Judge with Position-Bias Probe, Per-claim Factuality Judge (atomic decomposition), Pointwise Quality Scorer with Confidence, Reference-based Judge (output vs gold), Output-Level Regression Detector, Safety Output Classifier (defensive), Persona Consistency Judge, Preference Rationalization Judgememory— Long-Context Trajectory Memory Summarizer, Tool Output Summarizer (compress before context)multi-hop— Multi-hop RAG Eval Question Synthesizerocr— Handwriting Transcriber with Per-Word Confidence, OCR + Structured Extraction from Document Imagespairwise— Pairwise Judge with Position-Bias Probe, Iterative DPO Pair Generator, Long-Context Pairwise Preference Labeler, Pairwise Preference Labeler (HHH dimensions)planning— Budget-Aware Agent Planner, Clarification Question Asker, Error Recovery Strategy (retry / abort / escalate), Multi-Agent Conflict Resolver, Plan-and-Execute Upfront Planner, ReAct Planner with Strict Tool Call Schema, Self-Critique Reflection Step for Agents, Sub-Task Delegator (multi-agent prep)preference-labeling— Iterative DPO Pair Generator, Long-Context Pairwise Preference Labeler, Pairwise Preference Labeler (HHH dimensions)query-rewriting— Conversational Query Resolver (rewrite follow-ups as standalone), HyDE — Hypothetical Answer Generator for Retrieval, Query Rewriting and Decomposition for Retrieval, Time-Aware Retrieval Query Rewriterranking— Query Fusion (combine sub-query results into ranked set), Best-of-N Response Selector, Few-Shot Example Selector (pick best K demonstrations from a pool)rationale-summary— Self-Consistency Aggregator (majority vote over reasoning paths), Structured Reasoning with Rationale Summaryreact— ReAct Planner with Strict Tool Call Schemareflection— Error Recovery Strategy (retry / abort / escalate), Self-Critique Reflection Step for Agentsretrieval— Chunk Summarizer for Retrieval, Retrieved Context Compression, Conversational Query Resolver (rewrite follow-ups as standalone), HyDE — Hypothetical Answer Generator for Retrieval, Multi-Source Answer Aggregator (with conflict surfacing), Query Fusion (combine sub-query results into ranked set), Query Rewriting and Decomposition for Retrieval, Retrieval Relevance Evaluator, Structured RAG Output Builder (table / list / schema from evidence), Time-Aware Retrieval Query Rewriterreward-modeling— Pointwise Reward Scorer (single response → reward signal), Reward Hacking Detectorrubric— Human Eval Study Bootstrap, LLM-as-Judge Rubric for Open-Ended Outputs, Multi-Turn Dialogue Judge, Domain-Specific Rubric Generatorsafety— Safety Output Classifier (defensive), Red-Team Prompt Generator (defensive safety probes), SFT Data Quality Filterscoring— API Design Reviewer (REST / GraphQL / gRPC), Code Evaluation Judge, Code Review Checklist (structured findings), Code Security Review (focused), Contrastive Self-Consistency (compare against intentionally-wrong), Reasoning with Explicit Uncertainty Quantification, Calibration Checker (predicted confidence vs actual accuracy), LLM Judge Bias Probe (length / position / format / verbosity), Multi-Benchmark Leaderboard Builder, LLM-as-Judge Rubric for Open-Ended Outputs, Multi-Turn Dialogue Judge, Pairwise Judge with Position-Bias Probe, Per-claim Factuality Judge (atomic decomposition), Pointwise Quality Scorer with Confidence, Reference-based Judge (output vs gold), Output-Level Regression Detector, VLM Image Description Verifier, Visual Question Answering with Grounding and Confidence, Answer Grounding Checker (hallucination detector), Citation Faithfulness Scorer, Retrieval Relevance Evaluator, Best-of-N Response Selector, Helpfulness vs Harmlessness Tradeoff Scorer, Pairwise Preference Labeler (HHH dimensions), Persona Consistency Judge, Pointwise Reward Scorer (single response → reward signal), Preference Rationalization Judge, Refusal Calibration Probe, Reward Hacking Detector, SFT Data Coverage Analyzer, SFT Data Quality Filter, Instruction Difficulty Classifierseed-expansion— Instruction Variant Expander (seed → diverse rewrites), Self-Instruct — Generate New Instructions from a Seed Bankself-check— Self-Critique Reflection Step for Agents, Contrastive Self-Consistency (compare against intentionally-wrong), Plan Critique and Revise, Self-Consistency Aggregator (majority vote over reasoning paths), Self-Correction Protocol (accept / correct / reject), Tree-of-Thoughts (branch + evaluate + prune), Reasoning with Explicit Uncertainty Quantification, Verify-Then-Finalize (self-check before commit), Constitutional Critique-and-Revisestructured-output— API Result to User-Readable Translator, API Spec to Tool Catalog Converter, Budget-Aware Agent Planner, Clarification Question Asker, Error Recovery Strategy (retry / abort / escalate), Long-Context Trajectory Memory Summarizer, Multi-Agent Conflict Resolver, Plan-and-Execute Upfront Planner, ReAct Planner with Strict Tool Call Schema, Self-Critique Reflection Step for Agents, Sub-Task Delegator (multi-agent prep), Tool-Call Repair from Validation Error, Tool Output Summarizer (compress before context), API Design Reviewer (REST / GraphQL / gRPC), Code Evaluation Judge, Code Explanation Generator (audience-aware), Code Review Checklist (structured findings), Code Diff Summary for Pull Request, Code Translation Across Languages, Conventional Commit Message Generator, Dependency Impact Analyzer, Error Message / Stack Trace Explainer, Code Migration Plan Generator, Refactor Suggestion (with rationale and diff hint), Code Security Review (focused), Test Case Generator, Citation-Grounded Reasoning (every claim must cite), Contrastive Self-Consistency (compare against intentionally-wrong), Least-to-Most Decomposition, Meta-Prompt Generator (generate prompts for a class of tasks), Plan Critique and Revise, Self-Consistency Aggregator (majority vote over reasoning paths), Self-Correction Protocol (accept / correct / reject), Step-Back Prompting (abstract first, then solve), Structured Reasoning with Rationale Summary, Tree-of-Thoughts (branch + evaluate + prune), Reasoning with Explicit Uncertainty Quantification, Verify-Then-Finalize (self-check before commit), Calibration Checker (predicted confidence vs actual accuracy), Human Eval Study Bootstrap, LLM Judge Bias Probe (length / position / format / verbosity), Multi-Benchmark Leaderboard Builder, Multi-Turn Dialogue Judge, Pairwise Judge with Position-Bias Probe, Per-claim Factuality Judge (atomic decomposition), Pointwise Quality Scorer with Confidence, Reference-based Judge (output vs gold), Output-Level Regression Detector, Domain-Specific Rubric Generator, Safety Output Classifier (defensive), Chart and Table Extractor, Diagram to Structured Data, Document Layout Analyzer, Handwriting Transcriber with Per-Word Confidence, Custom-Category Image Classification, Image Pair Comparison Explainer, Image Edit Instruction Generator (before/after to instruction), OCR + Structured Extraction from Document Images, UI Screenshot to Component Spec, Structured Image Caption Generator, Visual Question Answering with Grounding and Confidence, Answer Grounding Checker (hallucination detector), Chunk Summarizer for Retrieval, Citation Faithfulness Scorer, Retrieved Context Compression, Conversational Query Resolver (rewrite follow-ups as standalone), Multi-Source Answer Aggregator (with conflict surfacing), Query Fusion (combine sub-query results into ranked set), Query Rewriting and Decomposition for Retrieval, Structured RAG Output Builder (table / list / schema from evidence), Time-Aware Retrieval Query Rewriter, Best-of-N Response Selector, Constitutional Critique-and-Revise, Helpfulness vs Harmlessness Tradeoff Scorer, Iterative DPO Pair Generator, Long-Context Pairwise Preference Labeler, Persona Consistency Judge, Pointwise Reward Scorer (single response → reward signal), Preference Rationalization Judge, Red-Team Prompt Generator (defensive safety probes), Refusal Calibration Probe, Reward Hacking Detector, Code SFT Pair Generator, Multi-Turn Conversation SFT Data Generator, SFT Data Coverage Analyzer, SFT Data Quality Filter, Few-Shot Example Selector (pick best K demonstrations from a pool), SFT Instruction Deduplicator, Instruction Difficulty Classifier, Persona-Controlled Response Generator, SFT Response Generator (instruction → high-quality response), Self-Instruct — Generate New Instructions from a Seed Bank, Style Transfer (rewrite text in target style)structured-reasoning— Citation-Grounded Reasoning (every claim must cite), Contrastive Self-Consistency (compare against intentionally-wrong), Least-to-Most Decomposition, Plan Critique and Revise, Self-Consistency Aggregator (majority vote over reasoning paths), Self-Correction Protocol (accept / correct / reject), Step-Back Prompting (abstract first, then solve), Structured Reasoning with Rationale Summary, Tree-of-Thoughts (branch + evaluate + prune), Reasoning with Explicit Uncertainty Quantification, Verify-Then-Finalize (self-check before commit)synthesis— Chunk Summarizer for Retrieval, Retrieved Context Compression, HyDE — Hypothetical Answer Generator for Retrieval, Multi-Source Answer Aggregator (with conflict surfacing), Multi-hop RAG Eval Question Synthesizer, Query Fusion (combine sub-query results into ranked set), Structured RAG Output Builder (table / list / schema from evidence)test-generation— Test Case Generatortool-use— API Result to User-Readable Translator, API Spec to Tool Catalog Converter, ReAct Planner with Strict Tool Call Schema, Tool-Call Repair from Validation Errorvision— Chart and Table Extractor, Diagram to Structured Data, Document Layout Analyzer, Handwriting Transcriber with Per-Word Confidence, Custom-Category Image Classification, Image Pair Comparison Explainer, Image Edit Instruction Generator (before/after to instruction), OCR + Structured Extraction from Document Images, UI Screenshot to Component Spec, Structured Image Caption Generator, VLM Image Description Verifier, Visual Question Answering with Grounding and Confidencevlm-eval— Chart and Table Extractor, Document Layout Analyzer, VLM Image Description Verifier, Visual Question Answering with Grounding and Confidence