Prompt Caching Basics: Stop Paying Full Price for the Same Tokens Twice
Every time you call the Claude API, you send the entire conversation: system prompt, tool definitions, and all messages from the very first turn. By default, you pay full input price for every byte of it, every single time.
Even if 95% of that payload is identical to the previous request. Even if you’re literally just appending “okay, now do step 2.” Claude bills you as if it has never seen any of it before.
Prompt caching changes that. Repeated content costs 10% of the normal input rate. A 50,000-token system prompt that costs $0.15 to send normally costs $0.015 cached. That’s before you multiply it across every request in a long session.
Let’s look at how dramatic that actually gets.
The math that should make you stop and fix this today
Say you have a 50,000-token system prompt (not unusual if you’re sending tool schemas, knowledge bases, or a substantial CLAUDE.md equivalent). You run 100 requests in a session.
Without caching (as of 2026-04-14, Sonnet 4.6 at $3/M input):
50,000 tokens × 100 requests × $3/M = $15.00
With caching (one write, 99 reads):
Write: 50,000 × $3/M × 1.25 = $0.1875 (write costs 25% more than input)
Reads: 50,000 × 99 × $0.30/M = $1.485 (reads cost 10% of normal input)
Total: $1.67
That’s 9× cheaper for the exact same behavior. No prompt changes, no quality trade-off, no behavior change. Just the same bytes going through a cache instead of a re-processor.
The cache entry pays for its write cost after the second request. Everything after that is pure savings.
How caching actually works
Anthropic’s cache is byte-prefix matching from position 0. That’s it. No semantic deduplication, no file-level hashing. The incoming byte stream is compared against recent requests byte by byte from the start. The moment a byte differs, the cache stops and everything from that point forward pays full price.
Three rules govern how this plays out in practice:
Rule 1: Identical bytes from byte 0 hit the cache. You can’t cache “middle” content. You can only cache prefixes. Content at the start of the payload caches best.
Rule 2: The payload has a fixed hierarchy. Tools are sent first, then system prompt, then messages. A change anywhere in that hierarchy invalidates everything below it. Change a tool definition? The entire request starts over, including your system prompt and every message.
Rule 3: There’s a minimum prefix length. Below the threshold, caching silently does nothing:
- Opus 4.6: 4,096 tokens minimum
- Sonnet 4.6 / Haiku 4.5: 2,048 tokens minimum
No error, no warning. If your system prompt is 1,800 tokens and you’re on Sonnet, you’re just not caching. You won’t know unless you look at the response fields.
Automatic caching
Prompt caching requires explicit opt-in — it is not on by default. There are two ways to enable it.
Explicit cache breakpoints give you fine-grained control: you add a cache_control marker directly to specific content blocks (system prompt, tool definitions, messages). The API caches everything up to and including that marker.
Automatic caching is a simpler approach added to the API: add a single top-level cache_control field to your request and the API handles breakpoint placement automatically, moving the cache point forward as the conversation grows. Still requires the cache_control field — not enabled absent it.
Which to use: explicit breakpoints for complex prompts with multiple sections that change at different rates; automatic caching for straightforward multi-turn conversations where you just want the history to cache. The source of truth is Anthropic’s prompt caching docs [1].
TTL: how long does the cache live?
Two options (as of 2026-04-14):
| TTL | Write cost premium | Read cost |
|---|---|---|
| 5 minutes (default) | +25% over normal input | 10% of normal input |
| 1 hour (opt-in) | +100% over normal input | 10% of normal input |
The 1-hour TTL doubles your write cost. Whether that’s worth it depends on your usage pattern. If your session spans more than a few minutes of processing, or you make repeated calls against the same prompt over an hour, the 1-hour TTL wins on total cost. If you’re doing quick one-off calls, the 5-minute default is fine.
The subscription tier gotcha you need to know about
Whether you actually get 1-hour TTL isn’t just a settings toggle — it depends on how you’re paying and whether telemetry is enabled.
| Situation | Main session TTL | Subagent TTL |
|---|---|---|
| API customer | 5m default | 5m default |
| Claude Pro/Max subscription | 1h (experiment rollout) | 5m (intentional) |
| Subscription + telemetry OFF | 5m (gates disabled) | 5m |
The relevant detail [2]: the 1-hour cache is rolled out as an experiment for subscribers, gated by client-side feature flags. Those flags are fetched via telemetry. Turn telemetry off, the experiment gate doesn’t load, and your cache silently downgrades from 1 hour to 5 minutes.
Subagents are intentionally kept at 5 minutes even for subscribers — the reasoning from Anthropic is that subagents are rarely resumed, so paying for a 1-hour window would be waste for most use cases.
Practical implication: if you disabled telemetry thinking it was a privacy or cost win, you may have silently cut your cache TTL by 12×. At scale that’s a significant unintended cost increase.
Current pricing (as of 2026-04-14)
Per million tokens:
| Model | Input | Output | Cache write (5m) | Cached read |
|---|---|---|---|---|
| Opus 4.6 | $5.00 | $25.00 | $6.25 | $0.50 |
| Sonnet 4.6 | $3.00 | $15.00 | $3.75 | $0.30 |
| Haiku 4.5 | $1.00 | $5.00 | $1.25 | $0.10 |
Cache write cost = input rate + 25% (5-min TTL). Cached read = 10% of input rate. Anthropic adjusts pricing periodically — verify current rates before building a business case.
What breaks your cache
This is the most valuable section in the post. Know this list cold.
The hierarchy is tools → system prompt → messages. A change at any level invalidates that level and everything below it.
| Change | What it invalidates |
|---|---|
| Adding or removing an MCP server | Everything (tools changed) |
| Editing any tool definition | Everything |
Editing your system prompt or CLAUDE.md | System + messages |
| Switching models | System + messages |
| Injecting dynamic content into the system prompt (timestamps, session IDs) | System + messages — on every single request |
Changing tool_choice parameter | Messages only |
| Adding or removing images | Messages only |
The biggest silent killer: hooks that inject dynamic content into the system prompt. This looks harmless in isolation:
{
"hooks": {
"PreToolUse": [{
"command": "echo \"Current time: $(date)\""
}]
}
}
If this timestamp or anything like it ends up in the system prompt, the cache is busted on every request. Every request pays full input price, forever, for the entire system prompt. It’s the kind of thing that costs you real money without any visible warning.
The fix is to keep dynamic content in the messages layer, not the system prompt. Or avoid injecting it at all if it’s not genuinely necessary.
The MCP server gotcha (adding/removing tools invalidates everything) is severe enough to warrant its own post — it’s a common Claude Code workflow pattern that accidentally tanks caching. More on that in The MCP Tool Prefix Trap.
Verifying that caching is actually working
Don’t assume it’s working. Check.
Every API response includes a usage object. When caching is working, you’ll see:
{
"model": "claude-sonnet-4-6",
"usage": {
"input_tokens": 1,
"output_tokens": 67,
"cache_creation_input_tokens": 287,
"cache_read_input_tokens": 30433
}
}
cache_read_input_tokens: 30433 means 30,433 tokens were served from cache at 10% of normal cost. cache_creation_input_tokens: 287 means 287 tokens were written to cache for next time. input_tokens: 1 is the new input not covered by cache.
What you want to see: high cache_read_input_tokens on repeated calls with the same prefix. What a problem looks like: cache_read_input_tokens: 0 on every call, or cache_creation_input_tokens matching your full system prompt size on every call (cache writes that never convert to reads mean the TTL is expiring before your next request).
For Claude Code users
Claude Code has OpenTelemetry support that surfaces these same metrics per API call. If you’re running multi-agent pipelines, OTel traces let you see cache hit rates per agent spawn, identify which agents aren’t caching their shared prefix, and measure improvement after optimization. The agent optimization post goes deep on the OTel setup and what to watch.
What’s coming in this series
This post covers the foundations. Three follow-ups dig into where caching breaks down in production:
The MCP server cost trap — every time you add or remove an MCP server mid-session, you’re busting cache on the entire payload. Claude Code workflows that dynamically add tools are particularly exposed. The numbers are ugly.
Skills vs Rules vs Read calls — where you put shared content (in-prompt vs rules files vs on-demand reads) determines whether it caches across agent types or only within a session. The placement decision has a large cost impact.
The subagent caching problem — when caching breaks even when you’ve configured it correctly. Subagents don’t inherit the parent session’s cache, and the minimum-threshold rule means short subagent tasks may not cache at all.
If you’re running Claude at any real scale and haven’t looked at your cache_read_input_tokens numbers, that’s the first thing to fix. Open a response log, find that field, and see what percentage of your input tokens are being cached. If it’s below 50% on repeated-pattern workflows, there’s money on the table.
Sources
- Prompt caching — Anthropic API Docs — canonical reference for caching mechanics, TTL options, and
cache_controlplacement - anthropics/claude-code#46829 — Anthropic explanation of 1h TTL subscription rollout, telemetry gate, and deliberate subagent 5m decision