Prompting Best Practices for LLMs

Prompt Engineering is the simplest and most effective way to improve LLM performance. Although it may seem like there’s no barrier to entry, it is genuinely effective in engineering practice.

At its core, Prompt Engineering does two things: providing more input and aligning with human intent.

Effective communication is a skill in itself — whether with people or with LLMs, there is no difference.

Excellent references:

Anthropic’s Prompting best practices — an exemplar of engineering rigor.
OpenAI’s Prompt Guidance — useful for sensing the differences between the two providers.
Prompting Guide — somewhat more academic, content may not be the most up-to-date, but quite comprehensive.

Prompt Engineering and Context Engineering overlap and intertwine. Prompt Engineering focuses more on a single conversation turn, while Context Engineering deals with context management across an agent’s entire lifecycle. We will touch on some Context Engineering principles as well.

Why Prompt Engineering?

Why do we need Prompt Engineering?

LLMs are strong. However, complex human needs often require complex descriptions.

For humans, long-term communication, shared working environments, and other factors make exchanges relatively efficient. A colleague who knows you well will follow your coding style, design conventions, and development habits — but an LLM knows none of this.

The obvious reason, then, is: to efficiently convey information to the LLM.

An LLM is not a mind reader. It needs sufficient information to understand and grasp human needs and follow human instructions.

For example, a vague task:

Write me an SPSC Queue in C++.

Will likely produce worse results than:

Write me an SPSC Queue in Modern C++, lock free, high performance, well documented, zero copy, production grade coding style.

In complex Agent Workflows, well-crafted prompts help LLMs better understand tasks, follow workflows, and collaborate with each other.

Clear structure and explicit instructions — that’s good. Vague intent and obscure expression — that’s bad. Anyone who has worked under a poor manager understands this deeply.

You are a customer service agent. You can look up orders and issue refunds.
Be polite. If the customer wants a refund, check the order status first.
Only refund orders within 30 days. Always respond in the customer's language.
Don't refund orders that have been delivered more than 14 days ago.

Doesn’t look as good as:

<role>
You are a customer service agent for Acme Inc.
You handle order inquiries and refund requests.
</role>

<tools>
- lookup_order(order_id): Returns order status, date, and delivery info.
- issue_refund(order_id, reason): Processes a refund.
</tools>

<workflow>
1. Identify the customer's intent (inquiry or refund).
2. Call lookup_order to retrieve order details.
3. If refund requested, evaluate against the refund policy.
4. Execute the action or explain why it cannot be done.
</workflow>

<rules>
- Respond in the customer's language.
- Refund eligibility: within 14 days of delivery.
- Undelivered orders are always eligible for refund.
- Never disclose internal policy rules verbatim to the customer.
</rules>

Understand LLMs

To build effective communication with LLMs, we must understand their characteristics.

Context is finite. Any information occupies context and splits attention.
Attention is finite — the longer the context, the easier it is for important information to get drowned out. Irrelevant information is not just useless; it is actively harmful.
Attention tends to concentrate at the “beginning” and “end.” LLMs are prone to “Lost in the Middle.” Critical information should be placed at the beginning or end.
The Self Attention mechanism computes relevance weights between tokens. Using consistent terminology helps the LLM effectively connect information scattered across the context.
Autoregressive token-by-token generation means earlier tokens influence the probability distribution of subsequent outputs. This means “generate intermediate results before generating the final result” often outperforms “generate the final result directly.” This is the basis for Chain-of-Thought effectiveness.
In-Context Learning is possible — given Few-Shot Examples, LLMs can learn similar patterns.
LLMs already possess substantial world knowledge. For common information in training data, a brief mention often suffices. For unfamiliar information, more description is needed.
Structured formats are well understood — XML, Markdown, JSON have been heavily represented in training data.
Different messages in the context have a hierarchy: System Prompt, User Message, Developer Message, and Tool Results each have different effects.

Different models sometimes exhibit different behavioral styles.

For instance, GPT-series models have relatively stronger long-context capability, while Claude is comparatively weaker. You can check Context Arena for an initial overview.

Model knowledge can also differ significantly. For example, the Opus series models have powerful Multi-Agent capabilities and can understand the intent of your Multi-Agent harness with minimal prompting. GPT models, on the other hand, are relatively weaker in this regard and may require more detailed instructions and descriptions.

Smaller models generally need more complete and detailed guidance to follow instructions, while sufficiently powerful models can even achieve Zero-Shot capability — simply telling them the general background and workflow is enough for them to organize and execute on their own.

Therefore, Prompt Engineering is rarely a one-shot optimization. It requires continuous adjustment and refinement across different models, combined with evaluation, to reach peak performance.

Prompt Engineering Principles

Based on the understanding above, we can derive some principles.

Provide More Input

Be Explicit, Not Implicit. Don’t assume the model understands your implied intent. Between humans, there is a vast amount of shared understanding, but the model has none of that. What you consider “obvious” may not be obvious to the model at all.
Show, Don’t Tell. Giving examples is often more effective than describing rules. Rather than writing a long paragraph explaining the output format you want, just provide an example directly.
Provide Context, Not Just Instructions. Tell the model “why” you’re doing something, not just “what” to do. Background information helps the model make better judgments in ambiguous situations.
Leverage Model’s Knowledge. For domains the model already knows well, be brief. Save precious context for unknown information — internal specifications, private APIs, business logic.

Align with Human Needs

Define the Role. Role setting may seem like voodoo, but it effectively narrows the model’s output distribution — a “Senior Security Engineer” and a “Junior Intern” will produce answers with different tendencies.
Specify Output Format. Make clear what format you want: JSON, Markdown, code, plain text. Models are good at following format constraints, but they need explicit instructions.
Set Boundaries. Clearly tell the model its scope — what it should do and what it should not do.
Handle Uncertainty Explicitly. Tell the model what to do when uncertain — ask the user, refuse to execute, or offer a conjecture. Without guidance, LLMs tend to hallucinate.
Decompose Complex Tasks. Breaking complex tasks into clear steps or subtasks guides agent execution and reduces reasoning difficulty.

Engineering Level

Concise and Effective. Optimize your expression — use fewer tokens to convey information efficiently.
Structure Your Prompt. Use XML tags, Markdown, and other structured formats to organize your prompt — clear and easy to understand.
Order Matters. Leverage attention distribution characteristics — place critical instructions at the beginning or end.
Progressive Disclosure. Information not needed immediately can be retrieved later. This has two benefits: it saves initial context, and it allows needed information to enter the context at the end, gaining higher attention weight. Similarly, embedding instructions in Tool Results is an effective technique.
Terminology Consistency. Using consistent terminology helps the model connect context and strengthen attention. On the flip side, when you need clear distinction and reduced interference, using rare or different terminology can also direct the model’s attention.
Use Direct, Assertive Language. “You MUST” and “You should” / “You can” have significantly different effects — use the appropriate tone for different scenarios. Instructions that must be followed should use strong constraints.
Iterate with Evaluation. Prompt Engineering is an experimental science, not something you write correctly the first time. It needs evaluation-driven iteration, especially when switching models.

Model Examples

Different models have different needs. Here we analyze their Best Practices through official documentation.

Claude

Reference: Prompting best practices.

Claude is smart, but has zero context. Therefore, you must provide sufficient information.

XML Tags are first-class citizens. Claude excels at understanding XML tags like <instructions>, <context>, <example>, <document>, etc. — nesting works too.

Few-shot Examples should be wrapped in <example> or <examples> to help Claude distinguish instructions from examples. The official recommendation is 3–5 examples.

Claude’s long-context capability is relatively weak, so careful arrangement and organization are needed.

Place long documents at the beginning of the prompt and queries/instructions at the end, wrapping documents in <document>. In testing, this can improve response quality by up to 30%.

For long-document tasks, ask Claude to first quote relevant passages from the source (into <quotes>), then answer based on the quotes. This helps the model locate key information amid noise.

Generally, explicit CoT guidance is unnecessary. A simple “Think Thoroughly” often works better than a hand-written Step-by-Step plan. Claude’s reasoning ability often exceeds human-prescribed steps.

Few-shot Examples can include <thinking> tags — Claude will generalize this reasoning pattern into its own thinking blocks.

Self-check is highly effective. Appending “Before you finish, verify your answer against [criteria]” shows significant improvement in Coding and Math tasks.

A major shift in Opus 4.6 compared to Opus 4.5: the model now acts more proactively — calling tools, using subagents, thinking — sometimes overly so, to the point where you need to constrain it.

Old Style: encouraging the model to use tools.

CRITICAL: You MUST use search_tool whenever the user asks a question. If in doubt, use the tool.

New Style: constraining the model to use tools only when appropriate.

Use search_tool when it would enhance your understanding of the problem.

Similar logic applies to subagents.

Overengineering: the model tends to create too many files, add excessive abstractions, and write defensive code for non-existent problems. Explicit scope constraints are needed to prevent this.

Overthinking: at high reasoning effort, Opus 4.6 performs extensive upfront exploration. Sometimes it helps to guide and appropriately lower the thinking intensity.

Choose an approach and commit to it. Avoid revisiting decisions unless new information directly contradicts your reasoning.

Claude now distinguishes between “suggest” and “change” — its default is to conservatively not execute. Prompts can guide whether the default should be to execute or not.

Tips for controlling output format:

Positive instructions outperform negative constraints. “Your response should be composed of flowing prose paragraphs” is more effective than “Do not use markdown.”
The format of the prompt influences the output format. If you want Markdown output, the prompt itself should ideally also use Markdown.
Opus 4.6 defaults to using LaTeX for mathematical content. If you don’t need this, you must explicitly disable it.
Structured Outputs can guarantee the output is valid, schema-compliant JSON.

There is also Context Awareness: the model can sense remaining context space and no longer works blindly. The model receives <budget:token_budget>200000</budget:token_budget> at the start of a conversation, and after each tool call receives <system_warning>Token usage: 35000/200000; 165000 remaining</system_warning>. The model can act accordingly.

Claude 4.6 is extremely adept at discovering state from the filesystem — sometimes starting from scratch is more effective than compaction.

You can leverage files to persist state.

Claude 4.6 may proactively perform irreversible operations (deleting files, force pushing, sending messages). The model does not automatically determine which operations need confirmation — you must list them explicitly. The official recommendation is to classify by “reversibility” and “blast radius”: local reversible operations (editing files, running tests) can proceed freely; irreversible or externally-visible operations (pushing, deleting branches, sending external messages) should require confirmation.

GPT

Reference: OpenAI Prompt Guidance.

OpenAI’s guidance primarily consists of various examples.

OpenAI distinguishes Message Roles, including Developer/User/Assistant, with priority in descending order.

Developer Messages are recommended to be organized in the order: Identity → Instructions → Examples → Context.

Most provided examples use XML-block format to establish a kind of contract.

For example, an output contract:

<output_contract>
- Return exactly the sections requested, in the requested order.
- Apply length limits only to their intended section.
- If format required (JSON/Markdown/XML), output ONLY that format.
</output_contract>

You can use forced completion checks and self-verification loops to prevent the agent from giving up halfway.

<completeness_contract>
- Treat the task as incomplete until all requested items are covered or explicitly marked [blocked].
- Keep an internal checklist of required deliverables.
- For lists or paginated results: track processed items, confirm coverage before finalizing.
</completeness_contract>
<verification_loop>
Before finalizing:
- Check: does output satisfy every requirement?
- Check: are claims backed by provided context?
- Check: does formatting match schema?
- Check: any external side effects need permission?
</verification_loop>

Similarly, explicit instructions are needed to prevent guessing when information is insufficient:

<missing_context_gating>
- If required context is missing, do NOT guess.
- Prefer the appropriate lookup tool when the missing context is retrievable.
- If you must proceed, label assumptions explicitly.
</missing_context_gating>

When context is relatively short and insufficient, GPT may choose the wrong tool, often requiring guidance:

<dependency_checks>
- Before taking an action, check whether prerequisite discovery, lookup, or memory retrieval steps are required.
- Do not skip prerequisite steps just because the intended final action seems obvious.
</dependency_checks>

For empty results from tool calls, GPT may prematurely conclude that empty means nonexistent.

<empty_result_recovery>
If lookup returns empty or partial results:
- Do not immediately conclude none exist.
- Try 1-2 fallback strategies (alternate query, broader filters, prerequisite lookup) before reporting failure.
</empty_result_recovery>

Sometimes you may need to encourage tool calls:

<tool_persistence_rules>
- Do not stop early when another tool call is likely to materially improve correctness or completeness.
- If a tool returns empty or partial results, retry with a different strategy.
</tool_persistence_rules>

GPT reasoning models are not recommended to include examples in Function Definitions — this degrades performance. A rather counterintuitive point that may warrant future testing.

The official article suggests that reasoning effort should be a last resort. The priority should be prompt guidance and context: refine the prompt itself, add completeness contracts, verification loops, tool persistence rules. Only increase effort when still unsatisfied. Most teams should default to using none, low, or medium. That said, I personally believe that in Coding Agent scenarios, starting with high/xhigh is perfectly fine.

To combat hallucination, explicit citations are needed:

<citation_rules>
- Only cite sources retrieved in the current workflow.
- Never fabricate citations, URLs, IDs, or quote spans.
- Attach citations to specific claims, not only at the end.
</citation_rules>

Combined with Grounding Rules: if sources conflict, explicitly state the conflict and attribute. If context is insufficient, narrow the answer scope but don’t guess wildly.

GPT’s long-context performance is relatively stable.

You can use prompts to make GPT more proactive about doing the work directly — similar to the Claude section, but with a somewhat different style.

<autonomy_and_persistence>
Unless the user explicitly asks for a plan or brainstorm,
assume the user wants you to make code changes, not just analysis.
Persist until the task is fully handled end-to-end.
</autonomy_and_persistence>

Conclusion & Skill

The content above can be summarized into a Skill, so that agents will know how to write better prompts in the future.

---
name: prompt-engineer
description: Guide for writing effective instructions, specifications, and context for LLM consumption. Covers every artifact type where one system communicates intent to an LLM — subagent task prompts, agent definitions (.claude/agents/), skill files (SKILL.md), MCP tool descriptions, project instructions (CLAUDE.md/AGENTS.md), system/developer prompts, evaluator prompts, few-shot sets, and structured context passing. Use this skill whenever crafting, rewriting, auditing, or debugging any text intended for an LLM to consume and follow — whether delegating to a subagent, defining a tool contract, writing project conventions, coordinating a multi-agent team, or structuring context for another model. Also use when the output quality of an LLM artifact feels off and you suspect the instructions are the bottleneck.
---

# Prompt Engineering

A prompt is a behavior contract between you and a system with limited attention and no shared memory. This applies to every artifact an LLM consumes: system prompts, subagent task descriptions, agent definitions, skill files, MCP tool descriptions, project instructions, team coordination protocols, evaluator criteria, and structured context. Write the minimum effective specification.

**When you are the author:** If you are an AI agent writing an artifact for another LLM — spawning a subagent, defining a tool, drafting a CLAUDE.md — apply this guide to your own output. Read the relevant artifact pattern, structure your output accordingly, and cold-read test it before delivering.

When principles conflict: safety > correctness > conciseness.

## Diagnose the real problem

Before touching text, determine whether the actual cause is:
- missing, wrong, or stale context
- overloaded context drowning signal
- poor retrieval or bad tool contract
- missing prerequisite steps or workflow decomposition
- conflicting instructions across message roles or files
- weak output contract or missing uncertainty handling
- inappropriate autonomy or permission boundaries
- artifact type mismatch (wrong format for the consumer)
- genuine wording problem

Do not solve context, retrieval, or tooling failures with better adjectives.

## Gather requirements

Do not start drafting until you know:
- **Artifact type**: system prompt, developer prompt, user prompt, subagent task, agent definition, skill file, MCP description, project instructions (CLAUDE.md), team protocol (AGENTS.md), evaluator prompt, few-shot set.
- **Consumer**: which model(s) and in what context (standalone, tool-using, agentic, multi-agent).
- **Target model**: Claude Opus 4.6, GPT 5.4, or cross-model.
- **Success condition**: what observable outcome defines done.
- **Audience**: who or what consumes the output downstream.
- **Constraints**: available tools, context budget, output format, hard boundaries, approval requirements.
- **Information gap**: what the model must be told vs. what it already knows.
- **Failure policy**: what should happen when context is missing, tools fail, or the model is uncertain.

If critical information is missing, ask for it. If proceeding under assumptions, state them.

## Principles

**Be explicit.** State every requirement, constraint, and edge case directly. Include background that changes judgment — omit the rest. The model fills gaps with its own priors, not yours.

**Handle uncertainty.** Specify behavior when context is missing or ambiguous: ask, retrieve, refuse, flag assumptions, narrow scope. This is the primary lever against hallucination. If citations are required, require grounded sources — never allow fabricated references.

**Show, don't just tell.** Examples outperform rules when behavior is easier to demonstrate than describe. Good examples are minimal, representative, and include edge cases. Bad examples — verbose, unrepresentative, or contradicting rules — are worse than none.

**Decompose.** Stage complex tasks with intermediate artifacts (evidence extraction, classification, checklists, candidate selection). Intermediate outputs improve final quality. Prefer light guidance over rigid CoT for strong models.

**Specify the output.** Format, structure, ordering, length, schema. For machine-consumed output, require exact schemas. For human-consumed output, constrain only what matters.

**Maximize signal density.** Every token should change behavior or provide necessary context. Spend tokens on private knowledge — internal APIs, business rules, proprietary conventions — not common knowledge. A longer artifact is justified when added tokens carry real constraints, context, or examples. Remove filler and duplication ruthlessly.

**Calibrate language.** "Must" for invariants, "should" for defaults, "may" for options. Marking everything MUST/CRITICAL/ALWAYS dilutes actual priorities into noise. Explain *why* behind non-obvious rules — LLMs respond better to reasoning than rigid directives.

**Order by attention.** Long reference material at the top, critical instructions at the beginning or end. Never bury important requirements in the middle. For artifacts with deferred loading (skills, agents), put triggering-critical info in the always-loaded portion.

**Use the right channel.** Durable policy in system/developer messages or project-level files (CLAUDE.md). Task-specific input in user messages or task prompts. Retrieved facts in tool results. Resolve conflicts rather than stacking contradictory rules across channels.

**Match the artifact to the consumer.** A subagent task prompt needs a clear completion condition and output location. An MCP description needs trigger clarity and parameter semantics. A CLAUDE.md needs durable conventions, not session-specific instructions. Choose the right artifact type for the information's lifecycle and audience.

## Build the artifact

Use the lightest structure that reliably induces the target behavior. Structure reduces ambiguity — it is not a ritual.

Include only sections that earn their tokens. General scaffold for prompt-type artifacts:

```text
<role>
Perspective or expertise — only if it changes the output.
Do not substitute roleplay for instructions.
Useful: "senior security reviewer." Low value: "world-class genius assistant."
</role>

<objective>
Task and observable completion condition.
Prefer "return valid JSON matching the schema" over "write a great answer."
</objective>

<context>
Background that changes decisions. Nothing else.
</context>

<tools>
Name, purpose, preconditions, limits, return schema.
When to use, when not to, and retry behavior.
</tools>

<workflow>
Staged steps when decomposition improves reliability.
Prefer observable intermediate artifacts over verbose chain-of-thought.
</workflow>

<rules>
Required behavior, prohibited behavior, actions needing confirmation.
Prefer positive specifications over negative-only rules.
State non-goals explicitly.
</rules>

<output_contract>
Format, structure, ordering, length, schema.
</output_contract>

<uncertainty_policy>
Behavior when context is missing or ambiguous.
</uncertainty_policy>

<examples>
Minimal, representative input-output pairs.
Include contrastive examples when boundaries matter.
Delimit clearly from live input.
</examples>

<verification>
Self-checks before finalizing: coverage, correctness, format, assumptions, permissions.
</verification>
```

Keep terminology consistent — same word for same concept, deliberately different words for different concepts. Parameterize values that change per invocation (paths, model names, thresholds) so the artifact is reusable.

## Artifact patterns

Each artifact type has its own shape. Apply the general principles above, then the type-specific patterns below.

### Subagent task prompts

A subagent gets one shot with limited context. It does not share your conversation history. Front-load what matters.

1. **Task** — what to do, in one sentence
2. **Context** — background the subagent lacks (file paths, architecture, conventions, decisions already made)
3. **Scope** — what to touch and what to leave alone
4. **Output** — where to save results, what format, what constitutes done
5. **Constraints** — tools available, files off-limits, time/token budget

**Good:** "Refactor all Express route handlers in /src/routes/ to use the new AuthMiddleware from /src/middleware/auth.ts. Do not modify test files. Save a summary of changes to /tmp/refactor-report.md. Done when all routes compile and existing tests pass."

**Bad:** "Refactor the auth system to be better. Let me know what you find."

Pitfalls: assuming shared context (the subagent has none), omitting output location, vague completion criteria ("improve it" vs "all tests pass and no new lint errors"), overloading one subagent with unrelated tasks.

### Agent definitions (`.claude/agents/`)

A reusable role specification. Loaded fresh each invocation with zero prior context.

1. **Identity** — what this agent does, one paragraph
2. **Capabilities** — tools available, what it can and cannot do
3. **Workflow** — default approach (investigate, plan, act, verify)
4. **Boundaries** — what requires escalation, what's autonomous, what's forbidden
5. **Communication** — how it reports status, asks for help, hands off work
6. **Quality bar** — what "done" looks like for this agent's typical tasks

Write for cold start. Everything the agent needs to orient must be here or discoverable from here.

### Skill files (SKILL.md)

A skill loads on demand. The description triggers loading; the body guides execution.

**Description (frontmatter):** What the skill does AND when to use it — both matter for triggering. Include concrete trigger phrases and adjacent use cases. Err toward over-triggering; it's easier to not-use a loaded skill than to miss loading one.

**Body:** Under 500 lines. Imperative instructions — tell the model what to do, not what it is. Include examples for behaviors easier to show than describe. Explain *why* behind non-obvious instructions. Use bundled `references/` for overflow. Parameterize environment-specific values.

### MCP tool descriptions

The description is the model's only signal for when to use a tool. Encode both trigger conditions and usage contract.

1. **What it does** — one sentence, precise verb
2. **When to use it** — specific scenarios and trigger conditions
3. **When NOT to use it** — adjacent scenarios that need a different tool
4. **Parameters** — name, type, description, constraints, defaults
5. **Return value** — shape, fields, edge cases (empty results, errors)
6. **Preconditions** — what must be true before calling

**Good description:** "Search internal documentation by keyword. Use when answering questions about company policies, architecture, or runbooks. Do NOT use for general knowledge questions or code search — use web_search or code_search instead."

**Bad description:** "A tool that searches documents."

Under ~200 words. Models skim tool lists; dense beats verbose.

### Project instructions (CLAUDE.md/AGENTS.md)

Durable conventions for an entire codebase. Loaded every session.

1. **Project overview** — what this codebase is, 2-3 sentences
2. **Architecture** — key directories, major components, data flow
3. **Development conventions** — language, framework, formatting, naming, commit style
4. **Build and test** — exact commands, not "run the tests"
5. **Boundaries** — files/patterns to avoid, things requiring human approval
6. **Common patterns** — how the team solves recurring problems here

**Good build command:** `pnpm --filter @app/frontend test -- --coverage`
**Bad build command:** "Run the frontend tests with coverage"

Pitfalls: session-specific instructions (belong in user messages), stale info, excessive rules without rationale, contradicting model built-in behaviors without good reason.

### Team coordination

How multiple agents work together. Must prevent conflicts and duplication.

1. **Team composition** — roles, capabilities, what each agent owns
2. **Task flow** — how work gets assigned, decomposed, handed off
3. **Communication protocol** — when to message, what to include, how to escalate
4. **Shared state** — where artifacts live, naming conventions, conflict resolution
5. **Completion criteria** — how the team knows the project is done
6. **Failure handling** — what happens when an agent is blocked, fails, or produces bad output

The critical challenge: preventing agents from duplicating work or making conflicting changes. Be explicit about ownership boundaries and synchronization points.

## Common anti-patterns

Recognize these in your own output and in artifacts you audit:

- **Adjective engineering.** "You are an incredibly thorough, world-class expert" adds zero behavioral change. Replace with specific instructions.
- **MUST/ALWAYS overload.** When everything is critical, nothing is. Reserve strong language for actual invariants; explain reasoning for the rest.
- **Contradictory pairs.** "Be concise but comprehensive." "Be creative but follow the template exactly." Pick one or specify the priority.
- **Assumed shared context.** Subagents, agents, and skills start cold. If they need file paths, architecture decisions, or conversation history, provide it explicitly.
- **Vague completion.** "Make it better" or "improve the code" gives the model no way to know when to stop. Specify an observable condition: "all tests pass," "lint errors reduced to zero," "output matches this schema."
- **Filler sections.** Sections that restate common knowledge or model defaults waste attention budget. Only include what changes behavior.
- **Negative-only rules.** "Don't do X, don't do Y, don't do Z" without saying what to do instead. Pair prohibitions with positive specifications.

## Adapt to task type

**Structured extraction or classification**: Define the schema or label set exactly. Include abstain/unknown when appropriate. Forbid extra prose if machine-consumed.

**Long-context tasks**: Source material first, query and instructions at the end. Extract evidence before synthesis when recall matters.

**Agentic and tool-using tasks**:
- Classify actions by reversibility — list explicitly which require confirmation.
- Specify retry and fallback behavior. Require at least one alternative before concluding failure.
- Capture baseline before acting so the agent can detect regressions.
- Define done, blocked, and when to ask rather than guess.
- Constrain scope to prevent overengineering and speculative exploration.

**Open-ended generation**: Specify audience, purpose, and failure modes. Do not overconstrain creative tasks unless consistency matters more than originality.

## Model adaptation

### Claude (Opus 4.6, Sonnet 4.6)

- XML tags for structure (`<context>`, `<rules>`, `<output>`)
- Documents and reference material first, instructions last — recency bias gives later instructions more weight
- Light reasoning guidance ("Think thoroughly about X") over rigid step-by-step plans
- Constrain proactive behavior with "use X when it would help" rather than "you MUST always use X" — overcompliance with hard rules creates rigidity
- Extended thinking available for complex reasoning — nudge with "think carefully" rather than prescribing CoT format
- Trust Claude to choose tools appropriately; describe tools precisely but don't over-prescribe usage

**Claude-specific example** — Instead of:
```
You MUST ALWAYS check for null values. You MUST NEVER skip validation.
```
Write:
```
Check for null values at system boundaries (user input, API responses) where they indicate missing data. Internal function calls between trusted modules can assume valid inputs.
```

### GPT (5.4 / Codex)

- Contract-style blocks (output_contract, completeness_contract, verification_loop) enforce thoroughness
- Encode prerequisite discovery steps explicitly — GPT benefits from structured workflows more than Claude
- For reasoning models: no examples in tool definitions — put them in the prompt body
- Optimize prompts before increasing reasoning effort — better instructions beat more thinking tokens
- JSON mode: use response_format with strict schemas for structured output
- Tool use: explicit priority ordering helps when multiple tools could apply
- Codex: include file structure context upfront; explicit "edit these files, don't create new ones" instructions

**GPT-specific example** — Instead of:
```
Review the code for issues.
```
Write:
```
## output_contract
Return a JSON array of findings, each with: file, line, severity, description, fix.
## verification_loop
After generating findings, re-read each file and confirm no findings were missed.
```

### Cross-model compatibility

When artifacts must work across both model families:
- Use markdown headers and clear section delimiters (both handle well)
- Avoid model-specific features (XML tags, JSON mode) in shared artifacts
- Write in plain imperative prose — universally effective
- The most portable format: clear markdown with descriptive headers, explicit examples, unambiguous output specifications
- Test on both targets — prompts that work on Claude may fail on GPT and vice versa

## Audit and revise

When an artifact isn't working, check for:
- hidden assumptions or missing success condition
- conflicting instructions across sections, message roles, or files
- missing output contract, uncertainty policy, or tool preconditions
- critical instructions buried in the middle of long context
- overloaded or irrelevant context
- inconsistent terminology
- examples that teach the wrong pattern
- missing safety or approval boundaries
- wrong artifact type for the use case
- no evaluation plan

Fix the smallest set of causes that explains the failure. Do not rewrite what works.

When revising:
1. Preserve the original intent.
2. Remove redundancy.
3. Resolve contradictions.
4. Convert vague preferences ("be helpful") into observable behaviors.
5. Replace negative-only rules with positive specifications.
6. Cold-read test: read the artifact as the target model seeing it for the first time. Would you know what to produce, what matters, what is optional, and what to do when blocked?
7. The revised artifact should have higher signal density. If longer, every added token must earn its place.

## Evaluate

For important or reusable artifacts, define a compact eval set covering:
- normal cases
- edge cases
- ambiguity cases
- tool-failure cases (if agentic)
- permission-boundary cases (if agentic)
- format compliance cases
- cross-model behavior (if targeting multiple models)

Revise against observed failures, not intuition. Re-test when changing models, tools, or surrounding workflow.

## Deliverable

When producing or revising an LLM-consumable artifact, return:
1. The artifact, ready to use.
2. Key assumptions or missing context.
3. Brief rationale for structural choices.
4. Eval cases for important or reusable artifacts.
5. Model-specific notes if targeting a particular model.