The Complete Guide to Claude Code Agent Optimization
Claude Code is powerful, but every token costs money. When you’re running multi-agent pipelines that spawn dozens of subagents per feature, the difference between a well-structured setup and a naive one can be hundreds of dollars per week.
This guide covers everything we learned optimizing agent definitions across 9 production repositories, reducing agent sizes by 60-87% and enabling cross-agent cache sharing that wasn’t possible before.
How Claude’s prompt cache actually works
Every time you send a message, Claude receives the entire conversation — system prompt plus all messages, from the very first turn. The conversation payload grows monotonically with every turn.
Caching avoids re-processing the identical beginning of that payload. Anthropic’s servers compare the incoming byte stream against recent requests. If the exact bytes match from position 0, those bytes are a cache hit: 10% of the normal input cost and faster processing. The moment a single byte differs, the cache stops, and everything after that point is processed at full price.
This is byte-prefix matching, not file-level deduplication. Two requests that both contain shared-rules.md at different byte offsets do NOT share cache on that content. The cache only knows “are these bytes identical to the previous request, starting from byte 0?”
Minimum thresholds
The prefix must be long enough for caching to activate:
- Opus 4.6: 4,096 tokens minimum
- Sonnet 4.6 / Haiku 4.5: 2,048 tokens minimum
Below these thresholds, caching silently does nothing. No error, no warning — you just pay full price every turn.
What this means practically
The order of content in the system prompt determines how much caches. Shared content early = maximum cache sharing. Unique content early = cache divergence that prevents sharing of everything after it.
The system prompt loading order
Claude Code builds the system prompt in layers. Each layer is bytes appended after the previous. The ordering is:
Layer 1: Claude Code harness (~15,000 tokens)
Layer 2: CLAUDE.md + .claude/rules/ (project-specific, shared across agents)
Layer 3: Agent body (unique per agent type)
Layer 4: Messages / conversation (unique per session)
What caches across what
Layers 1-2 are identical for every agent in the same repo. Two different agent types (web-dev and backend-dev) both get the harness + CLAUDE.md + rules as their prefix. This caches automatically.
Layer 3 is where agents diverge. A web-dev agent has different body text than a backend-dev agent. From this point on, the cache is per-agent-type.
Layer 4 is per-session. Each conversation has its own message history.
The caching tiers
| Tier | Content location | Caches across |
|---|---|---|
| Tier 1 | .claude/rules/*.md | All agent types, all turns |
| Tier 2 | Agent body (.claude/agents/*.md) | Same agent type, across turns |
| Tier 3 | Read results in messages (startup checklists) | Within same session only |
| Tier 4 | On-demand Read results | After the turn they’re loaded |
Moving content from Tier 3 to Tier 1 is the single biggest optimization available. It costs nothing to implement and the savings compound across every agent spawn.
The “Read shared-rules in the startup checklist” trap
This is the most common anti-pattern we found. Many agent definitions include a startup checklist like:
## Startup Checklist
1. Read `shared-rules.md`
2. Read `WORKING-AGREEMENTS.md`
3. Read `SESSION-GUIDE.md`
4. Show open tasks
The problem: when the agent Reads these files, the content lands in the message history (Tier 3), not in the system prompt (Tier 1). Since different agent types have different bodies (Layer 3), the cache has already diverged before reaching the messages. The shared-rules content cannot cache across agent types even though it’s identical.
The fix
Put shared content in .claude/rules/ instead:
.claude/rules/
shared-rules.md -> symlink to shared source
working-agreements.md -- project conventions
session-guide.md -- session architecture
project-context.md -- domain, tools, layout
Files in .claude/rules/ are auto-loaded into the system prompt at Layer 2 — before the agent body divergence point. No Read calls needed. Every agent type gets this content cached for free.
Result from our migration: ~6,400 tokens of universal content moved from Tier 3 to Tier 1. A pipeline run spawning 6 subagents saved ~38,400 tokens of redundant processing per run.
What goes where: the placement decision
| Content type | Put it in | Why |
|---|---|---|
| Cross-project rules | .claude/rules/ (symlink to shared source) | Tier 1 — cached across all agents |
| Project conventions | .claude/rules/working-agreements.md | Tier 1 — every agent follows these |
| Domain identity, tool references | .claude/rules/project-context.md | Tier 1 — every agent needs context |
| User preferences (accessibility, tooling) | .claude/rules/preferences.md (symlink) | Tier 1 — affects all agent output |
| Product summary | CLAUDE.md | Tier 1 — auto-loaded, cached |
| Role identity and judgment | Agent body (.claude/agents/) | Tier 2 — cached per agent type |
| Collaboration boundaries | Agent body (“What You Don’t Do”) | Tier 2 — per-agent routing |
| Reference data (tool syntax, file layouts) | On-demand Read | Tier 3-4 — loaded when needed |
| Task-specific data | Rally/Jira tools, logs | Not cached |
The key principle
Put shared content FIRST, unique content LAST. The cache matches from byte 0. Every byte of shared content before the divergence point is free across all agent types after the first call.
Separating persona from data
We audited 24 agent definitions across our repos and found a consistent pattern: most agents were carrying 50-70% data/reference content that didn’t belong in the agent body.
The persona/data audit method
For each section of an agent file, mark it as one of:
- Persona/judgment — role identity, collaboration boundaries, decision heuristics, failure modes. Keep in agent body.
- Data/reference — file paths, tool syntax, directory layouts, doc tables. Move to
.claude/rules/or load on demand. - Redundant — content that restates rules already in
.claude/rules/. Delete.
Results from real agents
| Agent | Before | After | Reduction | Category breakdown |
|---|---|---|---|---|
| Coach (orchestrator) | 2,785 tokens | 1,030 | 63% | 27% persona, 54% data, 10% redundant |
| Competitive Marketing | 4,882 tokens | 620 | 87% | 9% persona, 45% data, 25% procedure |
| Web Dev | 866 tokens | — | Already efficient | Mostly persona |
The heuristic
The more an agent does judgment, the more belongs in the body. The more it does procedure, the more belongs in command specs or rules.
- Judgment-heavy agents (standards expert, UX reviewer) — already efficient. Their content is irreducible role knowledge.
- Procedure-heavy agents (coach, orchestrator, marketing) — high trim potential. Their bulk is process specs and reference data that should live elsewhere.
- Mixed agents (web-dev, backend-dev) — moderate trim, usually ~30%.
What belongs in an agent body (and nothing else)
- Role identity (2-3 sentences) — what this agent is and what success looks like
- Startup checklist (3-5 items) — only aspect-SPECIFIC reads. Note that rules are auto-loaded.
- What You Own — responsibility boundaries
- What You Don’t Do — handoff routing to other agents. This is collaboration glue, not access control.
- Judgment rules — things NOT in shared rules. The stuff that makes this agent different.
- Key processes — one-line summaries with pointers to specs, not the specs inlined
What does NOT belong
- Content already in
.claude/rules/ - Tool syntax cookbooks (put in
rules/project-context.md) - Directory layouts (put in
rules/project-context.md) - Full process specifications (point to the command file)
- Content that restates shared rules by number
- Transient migration notes
Agent frontmatter: the overlooked optimization lever
The .claude/agents/ format supports frontmatter fields that enforce behavior structurally — more reliable than prose instructions and zero extra tokens at runtime.
Key fields
---
name: web-dev
description: "Front-end development. Handles UI, components, responsive
design, accessibility, and integration tests. Spawned for Web Dev tasks."
tools: Bash, Read, Write, Edit, Glob, Grep, TodoWrite
model: inherit
color: green
memory: project
---
description — This is what Claude reads when deciding whether to delegate to this agent. Write it like “when to use this,” not a role summary. More important than the body for routing decisions.
tools — Whitelist of available tools. This is structural enforcement of role boundaries. A Coach that shouldn’t write code simply doesn’t have Edit in its tools list. A UX reviewer that shouldn’t spawn subagents doesn’t have Agent. Better than prose rules like “Don’t write code” because it’s impossible to violate.
memory — Persistent memory across spawns. Options: project (committed to repo), local (gitignored), user (per-machine). Use project for agents that accumulate learnings. Subagents do NOT inherit the parent session’s memory — they start fresh unless this field is set.
maxTurns — Prevents runaway agents. Set based on expected task complexity.
effort — Override reasoning depth. low for simple tasks saves tokens.
The subagent memory gap
This one catches people. Auto-memory (your MEMORY.md and memory files) is loaded into the main session’s system prompt. When the main session spawns a subagent, that subagent gets a fresh context — it does NOT inherit the parent’s memory.
This means preferences saved to memory (“Charles is colorblind,” “use uv not pip”) are invisible to subagents. They need to be in either:
.claude/rules/— auto-loaded for everyone including subagents- The task prompt — explicitly stated when spawning the subagent
- Agent-specific memory — via the
memoryfrontmatter field (separate memory per agent type)
We moved five preference-type memories to .claude/rules/ to close this gap. Subagents now see them automatically.
Measuring cache performance with OpenTelemetry
Theory is nice, but you need to verify caching is working. Claude Code has built-in OpenTelemetry support that reports cache performance on every API call.
Setup
# In ~/.claude/settings.json env block:
CLAUDE_CODE_ENABLE_TELEMETRY=1
CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp
OTEL_TRACES_EXPORTER=otlp
OTEL_EXPORTER_OTLP_PROTOCOL=grpc
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_LOG_TOOL_DETAILS=1
Point it at a local Jaeger instance (Docker one-liner) and you get full trace visibility.
Key fields to watch
Every claude_code.api_request event includes:
input_tokens— total input tokensoutput_tokens— total output tokenscache_read_tokens— how many input tokens were a cache hitcache_creation_tokens— how many built new cache
High cache_read_tokens on a subagent’s first API call = the shared prefix is caching across agent types. Low = something is busting the cache.
Per-project tagging
Add OTEL_RESOURCE_ATTRIBUTES to each project’s settings to tag traces by project:
{
"env": {
"OTEL_RESOURCE_ATTRIBUTES": "loka=myproject"
}
}
Now you can filter traces by project in Jaeger and compare cache performance across repos.
Teammate pools vs fresh subagents: the caching trade-off
When an orchestrator spawns agents to do work, there are two patterns:
Fresh subagent (spawn per task): Each spawn starts a new conversation. Turn 1 pays full price on the agent body. If the subagent only lives 3 turns, 1/3 of its input was uncached.
Teammate in a pool (persist across tasks): The teammate persists. Turn 1 is a cold start, but turns 2, 3, 4… all cache the accumulated context. A teammate handling 4 tasks pays 1 cold start instead of 4.
The math
A 1,000-token agent body at Opus rates:
- Fresh subagent, 4 spawns: 4 x full-price first turn = ~$0.06
- Teammate, 4 tasks: 1 x full-price first turn + 3 x cached = ~$0.02
The savings compound with agent body size and spawn count.
When to use which
Fresh subagents (default): when tasks are independent and don’t benefit from shared context. Simpler, no lifecycle management.
Teammate pools: when multiple tasks share context (same feature, same codebase area, cross-task pattern recognition). Worth the lifecycle complexity for the caching benefit and accumulated context.
The complete optimization checklist
For a new project:
- Create
.claude/rules/with universal content (shared rules, conventions, preferences, project context) - Keep
CLAUDE.mdlean — product summary and pointers only - Write agent bodies with persona/judgment only — no data, no procedure specs
- Set
toolsin agent frontmatter to enforce role boundaries structurally - Set
memory: projectfor agents that accumulate learnings - Write
descriptionas a delegation hint, not a role summary - Target agent bodies under 1,500 tokens (audit if over 2,000)
- Enable OTel and verify cache_read_tokens are high on subagent spawns
- Use teammate pools for multi-task pipelines within a bounded scope
- Review agents quarterly — procedure creep is real
For an existing project with bloated agents:
- Run the persona/data audit on each agent
- Move data/reference content to
.claude/rules/project-context.md - Delete content that duplicates what’s already in
.claude/rules/ - Remove startup checklist items that Read auto-loaded files
- Measure before and after with OTel traces
Cost math: why this matters
At Opus 4.6 rates ($15/M input, $0.15/M cached input):
| Scenario | Before optimization | After | Savings |
|---|---|---|---|
| Single agent spawn (2,800 token body) | $0.042 | $0.015 (1,000 token body) | 64% |
| 6-agent pipeline run (shared rules in messages) | $0.58 in redundant Reads | $0.00 (rules cached in prefix) | 100% |
| 30-turn session (2,800 vs 1,000 token agent) | $0.63 extra cached input | $0.23 extra cached input | 63% |
| Weekly pipeline runs (5 features x 6 agents) | ~$17.40 in agent loading | ~$6.90 | 60% |
Small per-spawn. Large at scale.
Open questions we’re still investigating
-
Exact byte ordering of agent body vs rules in subagent system prompts. We believe rules/ loads before the agent body (enabling cross-agent caching), but haven’t verified empirically with OTel traces yet.
-
Cross-agent cache sharing for teammate pools. Teammates in the same team may share cache on the team infrastructure overhead. Untested.
-
The
skillsfrontmatter field as a replacement for startup Read calls. Skills preload content into the agent context without Read tool calls. Could be better than the rules/ approach for agent-specific reference data. -
Optimal agent body size vs cache efficiency. There may be a sweet spot where the agent body is large enough to cache effectively but small enough to minimize per-turn cost. We haven’t found it yet — our approach has been “as small as possible while retaining judgment quality.”