13 分钟 原稿

Prompting Best Practices for LLMs

Prompt Engineering 的核心原则与工程实践,覆盖 Claude 与 GPT 的模型特性与优化策略。

Prompt Engineering 是提升 LLM 表现最简单有效的办法。尽管大家觉得这毫无门槛,但在工程实践中,这确实很有效。

本质上讲,Prompt Engineering 会做两件事情:提供更多的输入、对齐人类的需求。

高效的沟通本身就是一门技能,和人、和 LLM,并无差别。

优秀的参考资料:

  1. Anthropic 的 Prompting best practices,工程化的典范。
  2. OpenAI 的 Prompt Guidance,可以感受两家的差异。
  3. Prompting Guide,相对偏学术一点,内容可能没有那么实时,但比较全。

Prompt Engineering 和 Context Engineering 相互覆盖,互有牵连。Prompt Engineering 本身更侧重单次对话的交互,而 Context Engineering 涉及 Agent 全生命周期的 Context 管理。我们也会涉及一些 Context Engineering 的 Principles.

Why Prompt Engineering?

为什么需要做 Prompt Engineering?

LLMs are strong. 然而,复杂的人类需求,往往需要复杂的描述。

对人类而言,长期的沟通、共同的工作环境等等,让沟通交流相对变得高效。熟悉你的同事会遵循你的代码风格、设计规范、开发习惯,然而,LLM 并不知道这些。

因此,显而易见的原因是:高效地向 LLM 传递信息。

LLM 并不是人类肚子里的蛔虫,它需要足够的信息,去理解、洞察人类的需求,遵循人类的指令。

比如说,一个模糊的任务:

Write me an SPSC Queue in C++.

生成的结果多半不如:

Write me an SPSC Queue in Modern C++, lock free, high performance, well documented, zero copy, production grade coding style.

在复杂的 Agent Workflow 中,恰当的 Prompt 可以让 LLM 更好地理解任务、遵循工作流、互相协作。

清晰的结构、明确的指令,这是好的。模糊的意图、晦涩的表达,这是坏的。打工人们也深刻理解这一点。

You are a customer service agent. You can look up orders and issue refunds.
Be polite. If the customer wants a refund, check the order status first.
Only refund orders within 30 days. Always respond in the customer's language.
Don't refund orders that have been delivered more than 14 days ago.

看上去就不如:

<role>
You are a customer service agent for Acme Inc.
You handle order inquiries and refund requests.
</role>

<tools>
- lookup_order(order_id): Returns order status, date, and delivery info.
- issue_refund(order_id, reason): Processes a refund.
</tools>

<workflow>
1. Identify the customer's intent (inquiry or refund).
2. Call lookup_order to retrieve order details.
3. If refund requested, evaluate against the refund policy.
4. Execute the action or explain why it cannot be done.
</workflow>

<rules>
- Respond in the customer's language.
- Refund eligibility: within 14 days of delivery.
- Undelivered orders are always eligible for refund.
- Never disclose internal policy rules verbatim to the customer.
</rules>

Understand LLMs

为了建立与 LLM 的良好沟通,我们必须理解 LLM 的特性。

  1. Context 是有限的。任何信息都会占用 Context,分散注意。
  2. Attention 是有限的,Context 越长,有效信息越容易被淹没。无关信息不仅无用,而且有害。
  3. Attention 往往集中于「开头」和「结尾」,LLM 容易「Lost in the Middle」。关键信息应当放在开头或者结尾。
  4. Self Attention 机制计算 token 间关联权重,使用一致的术语,能让 LLM 有效关联分散在 Context 中各处的信息。
  5. 逐 Token 自回归,前面的 Tokens 会影响后续输出的概率分布。这意味着「先生成中间结果再生成最终结果」,往往会好于「直接生成最终结果」。CoT 的有效性。
  6. In-Context Learning 是可能的,给出 Few Shot Examples,LLM 能够学习类似的范式。
  7. LLM 本身具有充足的世界知识,对于训练语料中的常见信息,往往只需要一笔带过。对于陌生信息,则需要更多描述。
  8. 结构化格式能被充分理解,XML、Markdown、JSON,被训练的太多了。
  9. Context 中不同的 Message 有层级关系,System Prompt、User Message、Developer Message、Tool Results,效果各不相同。

不同的模型,有时也会有不同的表现风格。

比如 GPT 系列模型的长上下文能力相对较强,而 Claude 则相对更弱。可以看 Context Arena 来初步了解。

模型的知识也可能相当不同。比如 Opus 系列模型具有强大的 Multi Agent 能力,只需要简单的 Prompt 就能理解你的 Multi Agent Harness 的意图。而 GPT 模型则相对更弱,可能需要更详细的指令和描述。

小的模型往往需要更完备、更详细的引导才能遵循指令,足够强的模型则甚至能具备 Zero Shot 能力,仅仅告诉它们大致的背景和流程,就能自行组织执行。

因此,Prompt Engineering 很难说是一步到位的优化,而需要对不同模型、结合 Evaluation,不断调整优化,才能抵达极致。

Prompt Engineering Principles

基于刚才的理解,我们可以得到一些 Princples 了。

提供更多的输入

  1. Be Explicit, Not Implicit. 不要假设模型理解你的隐含意图。人与人之间有大量共享的默契,但模型没有。你认为”显而易见”的事情,对模型来说可能并不显然。
  2. Show, Don’t Tell. 给出 examples 往往比描述规则更有效。与其写一大段话解释你想要的输出格式,不如直接给出例子。
  3. Provide Context, Not Just Instructions. 告诉模型”为什么”做这件事,而不仅仅是”做什么”。背景信息能帮助模型在模糊地带做出更合理的判断。
  4. Leverage Model’s Knowledge. 模型已知的领域简明扼要即可,把宝贵的 Context 留给未知信息——内部规范、私有 API、业务逻辑。

对齐人类的需求

  1. Define the Role. 角色设定看起来玄学,但它实际上能缩小模型的输出分布——一个 “Senior Security Engineer” 和一个 “Junior Intern” 会给出不同倾向的回答。
  2. Specify Output Format. 明确你要什么格式:JSON、Markdown、代码、纯文本。模型善于遵循格式约束,但需要明确指令。
  3. Set Boundaries. 明确告诉模型的 Scope,应该做什么,不应该做什么。
  4. Handle Uncertainty Explicitly. 告诉模型在不确定时怎么做——询问用户、拒绝执行、又或者给出猜想。如果不加引导,LLM 倾向于 Halluciate.
  5. Decompose Complex Tasks. 复杂任务拆解为明确的步骤或子任务,能引导 Agent 的执行,降低思考难度。

工程层面

  1. Concise and Effective. 优化表达,使用更少的 Tokens,高效传达信息。
  2. Structure Your Prompt. 用 XML tags、Markdown 等结构化格式组织 Prompt,清晰、易于理解。
  3. Order Matters. 利用 Attention 的分布特性,关键指令放在开头或结尾。
  4. Progressive Disclosure. 不立即需要的信息,可以以后再获取。这有两个好处:既节约了初始的 Context,又能让需要的信息进入 Context 的末尾,从而获得更高的权重。同理,在 Tool Results 中植入 Instructions,也是一个有效的做法。
  5. Terminology Consistency. 使用一致的术语,有助于模型联系上下文、增强注意力。另一点,当需要明确区分、减少干扰时,也可以使用罕见、不同的术语,引导模型注意力。
  6. Use Direct, Assertive Language. “You MUST” 和 “You should” “You can” 效果差异显著,注意不同场景使用不同的语气。需要遵循的指令应当使用强约束。
  7. Iterate with Evaluation. Prompt Engineering 是实验科学,不是一次写对的。需要 Evaluation 驱动迭代,尤其在切换模型时。

Model Examples

不同的模型有不同的需求,这里通过官方文档分析,总结它们的 Best Practices.

Claude

参考 Prompting best practices.

Claude 很聪明,但毫无 Context. 因此你必须给出足够的信息。

XML Tags 是一等公民,Claude 非常善于理解 XML Tags,如 <instructions> <context> <example> <document> 等,嵌套也可以。

Few-shot Examples 应当使用 <example><examples> 包裹,帮助 Claude 区分指令和示例。官方推荐 3-5 个 Examples.

Claude 的长上下文相对弱,因此需要精心的编排组织。

将长文档放在 Prompt 开头,Query 和指令放在结尾,文档用 <document> 包裹,测试中可提升最多 30% 的响应质量。

对长文档任务,要求 Claude 先引用原文相关段落(放入 <quotes>),再基于引用作答。这有助于模型从噪声中定位关键信息。

一般不需要显式引导 CoT,简单的「Think Thoroughly」往往比手写的 Step by Step Plan 更好。Claude 的推理能力往往超出人类的预设步骤。

Few-shot Examples 中可以包含 <thinking> tags, Claude 会将这种推理模式泛化到自己的 thinking blocks 中。

Self-check 很有效。 追加 “Before you finish, verify your answer against [criteria]“,在 Coding 和 Math 任务中效果显著。

Opus4.6 相比于 Opus4.5 的一大转变:模型开始更积极地行动,调用工具、使用 Subagent、思考,有时过于积极,以至于你需要约束模型。

Old Style:鼓励模型使用工具。

CRITICAL: You MUST use search_tool whenever the user asks a question. If in doubt, use the tool.

New Style:约束模型仅在合适的时候使用工具。

Use search_tool when it would enhance your understanding of the problem.

Subagent 也是类似的。

Overengineering:模型倾向于创建过多文件、添加过度抽象、为不存在的问题添加防御性编码,需要明确约束 Scope 来避免。

Overthinking:在高 Reasoning Effort 下,Opus 4.6 会进行大量前期探索,有时可以加以引导恰当降低思考强度。

Choose an approach and commit to it. Avoid revisiting decisions unless new information directly contradicts your reasoning.

Claude 现在会区分「suggest」和「change」,默认是保守地不执行。可以用 Prompt 引导默认执行还是默认不执行。

输出格式控制的技巧:

  1. 正面指令优于负面约束。“Your response should be composed of flowing prose paragraphs” 比 “Do not use markdown” 更有效。
  2. Prompt 的格式影响输出格式。如果你希望输出 Markdown,那么 Prompt 最好也用 Markdown.
  3. Opus 4.6 默认使用 LaTeX 表达数学内容,如果不需要,则必须显式关闭。
  4. Structured Outputs 可以保证输出是合法、符合约束的 JSON.

还有 Context Awareness:模型能感知剩余 context 空间,不再盲目工作。模型在对话开始时收到 <budget:token_budget>200000</budget:token_budget>,每次 tool call 后收到 <system_warning>Token usage: 35000/200000; 165000 remaining</system_warning>,而模型能据此行动。

Claude 4.6 极其擅长从文件系统中发现状态,有时从零开始比 compaction 更有效。

可以利用 Files 来保存状态。

Claude 4.6 可能主动执行不可逆操作(删除文件、force push、发消息)。模型不会自动判断哪些操作需要确认——你必须显式列出。官方推荐按「可逆性」和「影响范围」分级:本地可逆操作(编辑文件、运行测试)自由执行,不可逆或影响他人的操作(push、删除分支、发外部消息)需要确认。

GPT

参考 OpenAI Prompt Guidance.

OpenAI 的 Guidance 中主要是各种例子。

OpenAI 区分 Message Roles,包括 Developer/User/Assistant,优先级依次递减。

Developer Message 推荐按 Identity -> Instructions -> Examples -> Context 的顺序组织。

给出的例子大部分是以 XML block 的形式,给出一种 Contract.

例如 output contract:

<output_contract>
- Return exactly the sections requested, in the requested order.
- Apply length limits only to their intended section.
- If format required (JSON/Markdown/XML), output ONLY that format.
</output_contract>

可以利用强制检查是否完成、自检验循环等方式,阻止 Agent 半途而废。

<completeness_contract>
- Treat the task as incomplete until all requested items are covered or explicitly marked [blocked].
- Keep an internal checklist of required deliverables.
- For lists or paginated results: track processed items, confirm coverage before finalizing.
</completeness_contract>
<verification_loop>
Before finalizing:
- Check: does output satisfy every requirement?
- Check: are claims backed by provided context?
- Check: does formatting match schema?
- Check: any external side effects need permission?
</verification_loop>

同样需要明确的指令,防止它在信息不足时猜测:

<missing_context_gating>
- If required context is missing, do NOT guess.
- Prefer the appropriate lookup tool when the missing context is retrievable.
- If you must proceed, label assumptions explicitly.
</missing_context_gating>

当上下文相对较短、Context 不足时,GPT 可能会选择错误的工具,往往需要引导:

<dependency_checks>
- Before taking an action, check whether prerequisite discovery, lookup, or memory retrieval steps are required.
- Do not skip prerequisite steps just because the intended final action seems obvious.
</dependency_checks>

对于空结果,工具调用的返回,GPT 可能会过早地认为空就是没有。

<empty_result_recovery>
If lookup returns empty or partial results:
- Do not immediately conclude none exist.
- Try 1-2 fallback strategies (alternate query, broader filters, prerequisite lookup) before reporting failure.
</empty_result_recovery>

有的时候可能需要鼓励工具调用:

<tool_persistence_rules>
- Do not stop early when another tool call is likely to materially improve correctness or completeness.
- If a tool returns empty or partial results, retry with a different strategy.
</tool_persistence_rules>

GPT 推理模型不建议在 Function Definitions 中添加 Examples,这会降低性能,很反直觉的一点,或许以后需要测试一下。

官方文章中建议:reasoning effort 只是最终手段,应当优先考虑 Prompt 引导和 Context,完善 Prompt 本身,添加 completeness contract、verification loop、tool persistence rules,仍然不满意时,再提高 Effort. 大多数团队应当默认使用 none、low、medium. 不过我个人认为,在 Coding Agent 的场景,起手用 high/xhigh 也没什么问题。

为了对抗幻觉,需要明确 Citation:

<citation_rules>
- Only cite sources retrieved in the current workflow.
- Never fabricate citations, URLs, IDs, or quote spans.
- Attach citations to specific claims, not only at the end.
</citation_rules>

配合 Grounding Rules:若来源冲突,明确说明冲突并归因。若上下文不足,收窄回答范围,但不胡乱猜测。

GPT 的长上下文表现相对稳定。

可以使用 Prompt 让 GPT 更积极地直接干活,和 Claude 的那一段类似,但可以看出风格有一定不同。

<autonomy_and_persistence>
Unless the user explicitly asks for a plan or brainstorm,
assume the user wants you to make code changes, not just analysis.
Persist until the task is fully handled end-to-end.
</autonomy_and_persistence>

Conclusion & Skill

可以将上述的内容总结为一个 Skill,这样 Agent 以后就更懂如何写 Prompts 了。

---
name: prompt-engineer
description: Guide for writing effective instructions, specifications, and context for LLM consumption. Covers every artifact type where one system communicates intent to an LLM — subagent task prompts, agent definitions (.claude/agents/), skill files (SKILL.md), MCP tool descriptions, project instructions (CLAUDE.md/AGENTS.md), system/developer prompts, evaluator prompts, few-shot sets, and structured context passing. Use this skill whenever crafting, rewriting, auditing, or debugging any text intended for an LLM to consume and follow — whether delegating to a subagent, defining a tool contract, writing project conventions, coordinating a multi-agent team, or structuring context for another model. Also use when the output quality of an LLM artifact feels off and you suspect the instructions are the bottleneck.
---

# Prompt Engineering

A prompt is a behavior contract between you and a system with limited attention and no shared memory. This applies to every artifact an LLM consumes: system prompts, subagent task descriptions, agent definitions, skill files, MCP tool descriptions, project instructions, team coordination protocols, evaluator criteria, and structured context. Write the minimum effective specification.

**When you are the author:** If you are an AI agent writing an artifact for another LLM — spawning a subagent, defining a tool, drafting a CLAUDE.md — apply this guide to your own output. Read the relevant artifact pattern, structure your output accordingly, and cold-read test it before delivering.

When principles conflict: safety > correctness > conciseness.

## Diagnose the real problem

Before touching text, determine whether the actual cause is:
- missing, wrong, or stale context
- overloaded context drowning signal
- poor retrieval or bad tool contract
- missing prerequisite steps or workflow decomposition
- conflicting instructions across message roles or files
- weak output contract or missing uncertainty handling
- inappropriate autonomy or permission boundaries
- artifact type mismatch (wrong format for the consumer)
- genuine wording problem

Do not solve context, retrieval, or tooling failures with better adjectives.

## Gather requirements

Do not start drafting until you know:
- **Artifact type**: system prompt, developer prompt, user prompt, subagent task, agent definition, skill file, MCP description, project instructions (CLAUDE.md), team protocol (AGENTS.md), evaluator prompt, few-shot set.
- **Consumer**: which model(s) and in what context (standalone, tool-using, agentic, multi-agent).
- **Target model**: Claude Opus 4.6, GPT 5.4, or cross-model.
- **Success condition**: what observable outcome defines done.
- **Audience**: who or what consumes the output downstream.
- **Constraints**: available tools, context budget, output format, hard boundaries, approval requirements.
- **Information gap**: what the model must be told vs. what it already knows.
- **Failure policy**: what should happen when context is missing, tools fail, or the model is uncertain.

If critical information is missing, ask for it. If proceeding under assumptions, state them.

## Principles

**Be explicit.** State every requirement, constraint, and edge case directly. Include background that changes judgment — omit the rest. The model fills gaps with its own priors, not yours.

**Handle uncertainty.** Specify behavior when context is missing or ambiguous: ask, retrieve, refuse, flag assumptions, narrow scope. This is the primary lever against hallucination. If citations are required, require grounded sources — never allow fabricated references.

**Show, don't just tell.** Examples outperform rules when behavior is easier to demonstrate than describe. Good examples are minimal, representative, and include edge cases. Bad examples — verbose, unrepresentative, or contradicting rules — are worse than none.

**Decompose.** Stage complex tasks with intermediate artifacts (evidence extraction, classification, checklists, candidate selection). Intermediate outputs improve final quality. Prefer light guidance over rigid CoT for strong models.

**Specify the output.** Format, structure, ordering, length, schema. For machine-consumed output, require exact schemas. For human-consumed output, constrain only what matters.

**Maximize signal density.** Every token should change behavior or provide necessary context. Spend tokens on private knowledge — internal APIs, business rules, proprietary conventions — not common knowledge. A longer artifact is justified when added tokens carry real constraints, context, or examples. Remove filler and duplication ruthlessly.

**Calibrate language.** "Must" for invariants, "should" for defaults, "may" for options. Marking everything MUST/CRITICAL/ALWAYS dilutes actual priorities into noise. Explain *why* behind non-obvious rules — LLMs respond better to reasoning than rigid directives.

**Order by attention.** Long reference material at the top, critical instructions at the beginning or end. Never bury important requirements in the middle. For artifacts with deferred loading (skills, agents), put triggering-critical info in the always-loaded portion.

**Use the right channel.** Durable policy in system/developer messages or project-level files (CLAUDE.md). Task-specific input in user messages or task prompts. Retrieved facts in tool results. Resolve conflicts rather than stacking contradictory rules across channels.

**Match the artifact to the consumer.** A subagent task prompt needs a clear completion condition and output location. An MCP description needs trigger clarity and parameter semantics. A CLAUDE.md needs durable conventions, not session-specific instructions. Choose the right artifact type for the information's lifecycle and audience.

## Build the artifact

Use the lightest structure that reliably induces the target behavior. Structure reduces ambiguity — it is not a ritual.

Include only sections that earn their tokens. General scaffold for prompt-type artifacts:

```text
<role>
Perspective or expertise — only if it changes the output.
Do not substitute roleplay for instructions.
Useful: "senior security reviewer." Low value: "world-class genius assistant."
</role>

<objective>
Task and observable completion condition.
Prefer "return valid JSON matching the schema" over "write a great answer."
</objective>

<context>
Background that changes decisions. Nothing else.
</context>

<tools>
Name, purpose, preconditions, limits, return schema.
When to use, when not to, and retry behavior.
</tools>

<workflow>
Staged steps when decomposition improves reliability.
Prefer observable intermediate artifacts over verbose chain-of-thought.
</workflow>

<rules>
Required behavior, prohibited behavior, actions needing confirmation.
Prefer positive specifications over negative-only rules.
State non-goals explicitly.
</rules>

<output_contract>
Format, structure, ordering, length, schema.
</output_contract>

<uncertainty_policy>
Behavior when context is missing or ambiguous.
</uncertainty_policy>

<examples>
Minimal, representative input-output pairs.
Include contrastive examples when boundaries matter.
Delimit clearly from live input.
</examples>

<verification>
Self-checks before finalizing: coverage, correctness, format, assumptions, permissions.
</verification>
```

Keep terminology consistent — same word for same concept, deliberately different words for different concepts. Parameterize values that change per invocation (paths, model names, thresholds) so the artifact is reusable.

## Artifact patterns

Each artifact type has its own shape. Apply the general principles above, then the type-specific patterns below.

### Subagent task prompts

A subagent gets one shot with limited context. It does not share your conversation history. Front-load what matters.

1. **Task** — what to do, in one sentence
2. **Context** — background the subagent lacks (file paths, architecture, conventions, decisions already made)
3. **Scope** — what to touch and what to leave alone
4. **Output** — where to save results, what format, what constitutes done
5. **Constraints** — tools available, files off-limits, time/token budget

**Good:** "Refactor all Express route handlers in /src/routes/ to use the new AuthMiddleware from /src/middleware/auth.ts. Do not modify test files. Save a summary of changes to /tmp/refactor-report.md. Done when all routes compile and existing tests pass."

**Bad:** "Refactor the auth system to be better. Let me know what you find."

Pitfalls: assuming shared context (the subagent has none), omitting output location, vague completion criteria ("improve it" vs "all tests pass and no new lint errors"), overloading one subagent with unrelated tasks.

### Agent definitions (`.claude/agents/`)

A reusable role specification. Loaded fresh each invocation with zero prior context.

1. **Identity** — what this agent does, one paragraph
2. **Capabilities** — tools available, what it can and cannot do
3. **Workflow** — default approach (investigate, plan, act, verify)
4. **Boundaries** — what requires escalation, what's autonomous, what's forbidden
5. **Communication** — how it reports status, asks for help, hands off work
6. **Quality bar** — what "done" looks like for this agent's typical tasks

Write for cold start. Everything the agent needs to orient must be here or discoverable from here.

### Skill files (SKILL.md)

A skill loads on demand. The description triggers loading; the body guides execution.

**Description (frontmatter):** What the skill does AND when to use it — both matter for triggering. Include concrete trigger phrases and adjacent use cases. Err toward over-triggering; it's easier to not-use a loaded skill than to miss loading one.

**Body:** Under 500 lines. Imperative instructions — tell the model what to do, not what it is. Include examples for behaviors easier to show than describe. Explain *why* behind non-obvious instructions. Use bundled `references/` for overflow. Parameterize environment-specific values.

### MCP tool descriptions

The description is the model's only signal for when to use a tool. Encode both trigger conditions and usage contract.

1. **What it does** — one sentence, precise verb
2. **When to use it** — specific scenarios and trigger conditions
3. **When NOT to use it** — adjacent scenarios that need a different tool
4. **Parameters** — name, type, description, constraints, defaults
5. **Return value** — shape, fields, edge cases (empty results, errors)
6. **Preconditions** — what must be true before calling

**Good description:** "Search internal documentation by keyword. Use when answering questions about company policies, architecture, or runbooks. Do NOT use for general knowledge questions or code search — use web_search or code_search instead."

**Bad description:** "A tool that searches documents."

Under ~200 words. Models skim tool lists; dense beats verbose.

### Project instructions (CLAUDE.md/AGENTS.md)

Durable conventions for an entire codebase. Loaded every session.

1. **Project overview** — what this codebase is, 2-3 sentences
2. **Architecture** — key directories, major components, data flow
3. **Development conventions** — language, framework, formatting, naming, commit style
4. **Build and test** — exact commands, not "run the tests"
5. **Boundaries** — files/patterns to avoid, things requiring human approval
6. **Common patterns** — how the team solves recurring problems here

**Good build command:** `pnpm --filter @app/frontend test -- --coverage`
**Bad build command:** "Run the frontend tests with coverage"

Pitfalls: session-specific instructions (belong in user messages), stale info, excessive rules without rationale, contradicting model built-in behaviors without good reason.

### Team coordination

How multiple agents work together. Must prevent conflicts and duplication.

1. **Team composition** — roles, capabilities, what each agent owns
2. **Task flow** — how work gets assigned, decomposed, handed off
3. **Communication protocol** — when to message, what to include, how to escalate
4. **Shared state** — where artifacts live, naming conventions, conflict resolution
5. **Completion criteria** — how the team knows the project is done
6. **Failure handling** — what happens when an agent is blocked, fails, or produces bad output

The critical challenge: preventing agents from duplicating work or making conflicting changes. Be explicit about ownership boundaries and synchronization points.

## Common anti-patterns

Recognize these in your own output and in artifacts you audit:

- **Adjective engineering.** "You are an incredibly thorough, world-class expert" adds zero behavioral change. Replace with specific instructions.
- **MUST/ALWAYS overload.** When everything is critical, nothing is. Reserve strong language for actual invariants; explain reasoning for the rest.
- **Contradictory pairs.** "Be concise but comprehensive." "Be creative but follow the template exactly." Pick one or specify the priority.
- **Assumed shared context.** Subagents, agents, and skills start cold. If they need file paths, architecture decisions, or conversation history, provide it explicitly.
- **Vague completion.** "Make it better" or "improve the code" gives the model no way to know when to stop. Specify an observable condition: "all tests pass," "lint errors reduced to zero," "output matches this schema."
- **Filler sections.** Sections that restate common knowledge or model defaults waste attention budget. Only include what changes behavior.
- **Negative-only rules.** "Don't do X, don't do Y, don't do Z" without saying what to do instead. Pair prohibitions with positive specifications.

## Adapt to task type

**Structured extraction or classification**: Define the schema or label set exactly. Include abstain/unknown when appropriate. Forbid extra prose if machine-consumed.

**Long-context tasks**: Source material first, query and instructions at the end. Extract evidence before synthesis when recall matters.

**Agentic and tool-using tasks**:
- Classify actions by reversibility — list explicitly which require confirmation.
- Specify retry and fallback behavior. Require at least one alternative before concluding failure.
- Capture baseline before acting so the agent can detect regressions.
- Define done, blocked, and when to ask rather than guess.
- Constrain scope to prevent overengineering and speculative exploration.

**Open-ended generation**: Specify audience, purpose, and failure modes. Do not overconstrain creative tasks unless consistency matters more than originality.

## Model adaptation

### Claude (Opus 4.6, Sonnet 4.6)

- XML tags for structure (`<context>`, `<rules>`, `<output>`)
- Documents and reference material first, instructions last — recency bias gives later instructions more weight
- Light reasoning guidance ("Think thoroughly about X") over rigid step-by-step plans
- Constrain proactive behavior with "use X when it would help" rather than "you MUST always use X" — overcompliance with hard rules creates rigidity
- Extended thinking available for complex reasoning — nudge with "think carefully" rather than prescribing CoT format
- Trust Claude to choose tools appropriately; describe tools precisely but don't over-prescribe usage

**Claude-specific example** — Instead of:
```
You MUST ALWAYS check for null values. You MUST NEVER skip validation.
```
Write:
```
Check for null values at system boundaries (user input, API responses) where they indicate missing data. Internal function calls between trusted modules can assume valid inputs.
```

### GPT (5.4 / Codex)

- Contract-style blocks (output_contract, completeness_contract, verification_loop) enforce thoroughness
- Encode prerequisite discovery steps explicitly — GPT benefits from structured workflows more than Claude
- For reasoning models: no examples in tool definitions — put them in the prompt body
- Optimize prompts before increasing reasoning effort — better instructions beat more thinking tokens
- JSON mode: use response_format with strict schemas for structured output
- Tool use: explicit priority ordering helps when multiple tools could apply
- Codex: include file structure context upfront; explicit "edit these files, don't create new ones" instructions

**GPT-specific example** — Instead of:
```
Review the code for issues.
```
Write:
```
## output_contract
Return a JSON array of findings, each with: file, line, severity, description, fix.
## verification_loop
After generating findings, re-read each file and confirm no findings were missed.
```

### Cross-model compatibility

When artifacts must work across both model families:
- Use markdown headers and clear section delimiters (both handle well)
- Avoid model-specific features (XML tags, JSON mode) in shared artifacts
- Write in plain imperative prose — universally effective
- The most portable format: clear markdown with descriptive headers, explicit examples, unambiguous output specifications
- Test on both targets — prompts that work on Claude may fail on GPT and vice versa

## Audit and revise

When an artifact isn't working, check for:
- hidden assumptions or missing success condition
- conflicting instructions across sections, message roles, or files
- missing output contract, uncertainty policy, or tool preconditions
- critical instructions buried in the middle of long context
- overloaded or irrelevant context
- inconsistent terminology
- examples that teach the wrong pattern
- missing safety or approval boundaries
- wrong artifact type for the use case
- no evaluation plan

Fix the smallest set of causes that explains the failure. Do not rewrite what works.

When revising:
1. Preserve the original intent.
2. Remove redundancy.
3. Resolve contradictions.
4. Convert vague preferences ("be helpful") into observable behaviors.
5. Replace negative-only rules with positive specifications.
6. Cold-read test: read the artifact as the target model seeing it for the first time. Would you know what to produce, what matters, what is optional, and what to do when blocked?
7. The revised artifact should have higher signal density. If longer, every added token must earn its place.

## Evaluate

For important or reusable artifacts, define a compact eval set covering:
- normal cases
- edge cases
- ambiguity cases
- tool-failure cases (if agentic)
- permission-boundary cases (if agentic)
- format compliance cases
- cross-model behavior (if targeting multiple models)

Revise against observed failures, not intuition. Re-test when changing models, tools, or surrounding workflow.

## Deliverable

When producing or revising an LLM-consumable artifact, return:
1. The artifact, ready to use.
2. Key assumptions or missing context.
3. Brief rationale for structural choices.
4. Eval cases for important or reusable artifacts.
5. Model-specific notes if targeting a particular model.