Apr 9, 2026

The Complete Guide to Claude Code Agent Optimization

Claude Code is powerful, but every token costs money. When you’re running multi-agent pipelines that spawn dozens of subagents per feature, the difference between a well-structured setup and a naive one can be hundreds of dollars per week.

This guide covers everything we learned optimizing agent definitions across 9 production repositories, reducing agent sizes by 60-87% and enabling cross-agent cache sharing that wasn’t possible before.

How Claude’s prompt cache actually works

Every time you send a message, Claude receives the entire conversation — system prompt plus all messages, from the very first turn. The conversation payload grows monotonically with every turn.

Caching avoids re-processing the identical beginning of that payload. Anthropic’s servers compare the incoming byte stream against recent requests. If the exact bytes match from position 0, those bytes are a cache hit: 10% of the normal input cost and faster processing. The moment a single byte differs, the cache stops, and everything after that point is processed at full price.

This is byte-prefix matching, not file-level deduplication. Two requests that both contain shared-rules.md at different byte offsets do NOT share cache on that content. The cache only knows “are these bytes identical to the previous request, starting from byte 0?”

Minimum thresholds

The prefix must be long enough for caching to activate:

Opus 4.6: 4,096 tokens minimum
Sonnet 4.6 / Haiku 4.5: 2,048 tokens minimum

Below these thresholds, caching silently does nothing. No error, no warning — you just pay full price every turn.

What this means practically

The order of content in the system prompt determines how much caches. Shared content early = maximum cache sharing. Unique content early = cache divergence that prevents sharing of everything after it.

The system prompt loading order

Claude Code builds the system prompt in layers. Each layer is bytes appended after the previous. The ordering is:

Layer 1: Claude Code harness         (~15,000 tokens)
Layer 2: CLAUDE.md + .claude/rules/  (project-specific, shared across agents)
Layer 3: Agent body                  (unique per agent type)
Layer 4: Messages / conversation     (unique per session)

What caches across what

Layers 1-2 are identical for every agent in the same repo. Two different agent types (web-dev and backend-dev) both get the harness + CLAUDE.md + rules as their prefix. This caches automatically.

Layer 3 is where agents diverge. A web-dev agent has different body text than a backend-dev agent. From this point on, the cache is per-agent-type.

Layer 4 is per-session. Each conversation has its own message history.

The caching tiers

Tier	Content location	Caches across
Tier 1	`.claude/rules/*.md`	All agent types, all turns
Tier 2	Agent body (`.claude/agents/*.md`)	Same agent type, across turns
Tier 3	Read results in messages (startup checklists)	Within same session only
Tier 4	On-demand Read results	After the turn they’re loaded

Moving content from Tier 3 to Tier 1 is the single biggest optimization available. It costs nothing to implement and the savings compound across every agent spawn.

The “Read shared-rules in the startup checklist” trap

This is the most common anti-pattern we found. Many agent definitions include a startup checklist like:

## Startup Checklist
1. Read `shared-rules.md`
2. Read `WORKING-AGREEMENTS.md`
3. Read `SESSION-GUIDE.md`
4. Show open tasks

The problem: when the agent Reads these files, the content lands in the message history (Tier 3), not in the system prompt (Tier 1). Since different agent types have different bodies (Layer 3), the cache has already diverged before reaching the messages. The shared-rules content cannot cache across agent types even though it’s identical.

The fix

Put shared content in .claude/rules/ instead:

.claude/rules/
  shared-rules.md        -> symlink to shared source
  working-agreements.md  -- project conventions
  session-guide.md       -- session architecture
  project-context.md     -- domain, tools, layout

Files in .claude/rules/ are auto-loaded into the system prompt at Layer 2 — before the agent body divergence point. No Read calls needed. Every agent type gets this content cached for free.

Result from our migration: ~6,400 tokens of universal content moved from Tier 3 to Tier 1. A pipeline run spawning 6 subagents saved ~38,400 tokens of redundant processing per run.

What goes where: the placement decision

Content type	Put it in	Why
Cross-project rules	`.claude/rules/` (symlink to shared source)	Tier 1 — cached across all agents
Project conventions	`.claude/rules/working-agreements.md`	Tier 1 — every agent follows these
Domain identity, tool references	`.claude/rules/project-context.md`	Tier 1 — every agent needs context
User preferences (accessibility, tooling)	`.claude/rules/preferences.md` (symlink)	Tier 1 — affects all agent output
Product summary	`CLAUDE.md`	Tier 1 — auto-loaded, cached
Role identity and judgment	Agent body (`.claude/agents/`)	Tier 2 — cached per agent type
Collaboration boundaries	Agent body (“What You Don’t Do”)	Tier 2 — per-agent routing
Reference data (tool syntax, file layouts)	On-demand Read	Tier 3-4 — loaded when needed
Task-specific data	Rally/Jira tools, logs	Not cached

The key principle

Put shared content FIRST, unique content LAST. The cache matches from byte 0. Every byte of shared content before the divergence point is free across all agent types after the first call.

Separating persona from data

We audited 24 agent definitions across our repos and found a consistent pattern: most agents were carrying 50-70% data/reference content that didn’t belong in the agent body.

The persona/data audit method

For each section of an agent file, mark it as one of:

Persona/judgment — role identity, collaboration boundaries, decision heuristics, failure modes. Keep in agent body.
Data/reference — file paths, tool syntax, directory layouts, doc tables. Move to .claude/rules/ or load on demand.
Redundant — content that restates rules already in .claude/rules/. Delete.

Results from real agents

Agent	Before	After	Reduction	Category breakdown
Coach (orchestrator)	2,785 tokens	1,030	63%	27% persona, 54% data, 10% redundant
Competitive Marketing	4,882 tokens	620	87%	9% persona, 45% data, 25% procedure
Web Dev	866 tokens	—	Already efficient	Mostly persona

The heuristic

The more an agent does judgment, the more belongs in the body. The more it does procedure, the more belongs in command specs or rules.

Judgment-heavy agents (standards expert, UX reviewer) — already efficient. Their content is irreducible role knowledge.
Procedure-heavy agents (coach, orchestrator, marketing) — high trim potential. Their bulk is process specs and reference data that should live elsewhere.
Mixed agents (web-dev, backend-dev) — moderate trim, usually ~30%.

What belongs in an agent body (and nothing else)

Role identity (2-3 sentences) — what this agent is and what success looks like
Startup checklist (3-5 items) — only aspect-SPECIFIC reads. Note that rules are auto-loaded.
What You Own — responsibility boundaries
What You Don’t Do — handoff routing to other agents. This is collaboration glue, not access control.
Judgment rules — things NOT in shared rules. The stuff that makes this agent different.
Key processes — one-line summaries with pointers to specs, not the specs inlined

What does NOT belong

Content already in .claude/rules/
Tool syntax cookbooks (put in rules/project-context.md)
Directory layouts (put in rules/project-context.md)
Full process specifications (point to the command file)
Content that restates shared rules by number
Transient migration notes

Agent frontmatter: the overlooked optimization lever

The .claude/agents/ format supports frontmatter fields that enforce behavior structurally — more reliable than prose instructions and zero extra tokens at runtime.

Key fields

---
name: web-dev
description: "Front-end development. Handles UI, components, responsive 
  design, accessibility, and integration tests. Spawned for Web Dev tasks."
tools: Bash, Read, Write, Edit, Glob, Grep, TodoWrite
model: inherit
color: green
memory: project
---

description — This is what Claude reads when deciding whether to delegate to this agent. Write it like “when to use this,” not a role summary. More important than the body for routing decisions.

tools — Whitelist of available tools. This is structural enforcement of role boundaries. A Coach that shouldn’t write code simply doesn’t have Edit in its tools list. A UX reviewer that shouldn’t spawn subagents doesn’t have Agent. Better than prose rules like “Don’t write code” because it’s impossible to violate.

memory — Persistent memory across spawns. Options: project (committed to repo), local (gitignored), user (per-machine). Use project for agents that accumulate learnings. Subagents do NOT inherit the parent session’s memory — they start fresh unless this field is set.

maxTurns — Prevents runaway agents. Set based on expected task complexity.

effort — Override reasoning depth. low for simple tasks saves tokens.

The subagent memory gap

This one catches people. Auto-memory (your MEMORY.md and memory files) is loaded into the main session’s system prompt. When the main session spawns a subagent, that subagent gets a fresh context — it does NOT inherit the parent’s memory.

This means preferences saved to memory (“Charles is colorblind,” “use uv not pip”) are invisible to subagents. They need to be in either:

.claude/rules/ — auto-loaded for everyone including subagents
The task prompt — explicitly stated when spawning the subagent
Agent-specific memory — via the memory frontmatter field (separate memory per agent type)

We moved five preference-type memories to .claude/rules/ to close this gap. Subagents now see them automatically.

Measuring cache performance with OpenTelemetry

Theory is nice, but you need to verify caching is working. Claude Code has built-in OpenTelemetry support that reports cache performance on every API call.

Setup

# In ~/.claude/settings.json env block:
CLAUDE_CODE_ENABLE_TELEMETRY=1
CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_LOG_TOOL_DETAILS=1

Point it at a local Jaeger instance (Docker one-liner) and you get full trace visibility.

Key fields to watch

Every claude_code.api_request event includes:

input_tokens — total input tokens
output_tokens — total output tokens
cache_read_tokens — how many input tokens were a cache hit
cache_creation_tokens — how many built new cache

High cache_read_tokens on a subagent’s first API call = the shared prefix is caching across agent types. Low = something is busting the cache.

Per-project tagging

Add OTEL_RESOURCE_ATTRIBUTES to each project’s settings to tag traces by project:

{
  "env": {
    "OTEL_RESOURCE_ATTRIBUTES": "loka=myproject"
  }
}

Now you can filter traces by project in Jaeger and compare cache performance across repos.

Teammate pools vs fresh subagents: the caching trade-off

When an orchestrator spawns agents to do work, there are two patterns:

Fresh subagent (spawn per task): Each spawn starts a new conversation. Turn 1 pays full price on the agent body. If the subagent only lives 3 turns, 1/3 of its input was uncached.

Teammate in a pool (persist across tasks): The teammate persists. Turn 1 is a cold start, but turns 2, 3, 4… all cache the accumulated context. A teammate handling 4 tasks pays 1 cold start instead of 4.

The math

A 1,000-token agent body at Opus rates:

Fresh subagent, 4 spawns: 4 x full-price first turn = ~$0.06
Teammate, 4 tasks: 1 x full-price first turn + 3 x cached = ~$0.02

The savings compound with agent body size and spawn count.

When to use which

Fresh subagents (default): when tasks are independent and don’t benefit from shared context. Simpler, no lifecycle management.

Teammate pools: when multiple tasks share context (same feature, same codebase area, cross-task pattern recognition). Worth the lifecycle complexity for the caching benefit and accumulated context.

The complete optimization checklist

For a new project:

Create .claude/rules/ with universal content (shared rules, conventions, preferences, project context)
Keep CLAUDE.md lean — product summary and pointers only
Write agent bodies with persona/judgment only — no data, no procedure specs
Set tools in agent frontmatter to enforce role boundaries structurally
Set memory: project for agents that accumulate learnings
Write description as a delegation hint, not a role summary
Target agent bodies under 1,500 tokens (audit if over 2,000)
Enable OTel and verify cache_read_tokens are high on subagent spawns
Use teammate pools for multi-task pipelines within a bounded scope
Review agents quarterly — procedure creep is real

For an existing project with bloated agents:

Run the persona/data audit on each agent
Move data/reference content to .claude/rules/project-context.md
Delete content that duplicates what’s already in .claude/rules/
Remove startup checklist items that Read auto-loaded files
Measure before and after with OTel traces

Cost math: why this matters

At Opus 4.6 rates ($15/M input, $0.15/M cached input):

Scenario	Before optimization	After	Savings
Single agent spawn (2,800 token body)	$0.042	$0.015 (1,000 token body)	64%
6-agent pipeline run (shared rules in messages)	$0.58 in redundant Reads	$0.00 (rules cached in prefix)	100%
30-turn session (2,800 vs 1,000 token agent)	$0.63 extra cached input	$0.23 extra cached input	63%
Weekly pipeline runs (5 features x 6 agents)	~$17.40 in agent loading	~$6.90	60%

Small per-spawn. Large at scale.

Open questions we’re still investigating

Exact byte ordering of agent body vs rules in subagent system prompts. We believe rules/ loads before the agent body (enabling cross-agent caching), but haven’t verified empirically with OTel traces yet.
Cross-agent cache sharing for teammate pools. Teammates in the same team may share cache on the team infrastructure overhead. Untested.
The skills frontmatter field as a replacement for startup Read calls. Skills preload content into the agent context without Read tool calls. Could be better than the rules/ approach for agent-specific reference data.
Optimal agent body size vs cache efficiency. There may be a sweet spot where the agent body is large enough to cache effectively but small enough to minimize per-turn cost. We haven’t found it yet — our approach has been “as small as possible while retaining judgment quality.”