Grumpy old accountant in green visor and dark waistcoat, hunched at a lamplit desk, holding a magnifying glass up to a computer monitor showing a JSON API usage block

Apr 14, 2026

Prompt Caching: Read the Usage Block

Updates (1) — Apr 15

2026-04-15: Claude Code v2.1.108 fixed the DISABLE_TELEMETRY to 5-minute TTL downgrade called out below. Telemetry off, TTL is back to 1 hour after you upgrade. ENABLE_PROMPT_CACHING_1H is now universal (API key, Bedrock, Vertex, Foundry), previously Bedrock-only. A companion FORCE_PROMPT_CACHING_5M landed too, for bursty workloads where the 2x write premium on 1h is not worth it. Release notes.

Here is the usage block from one Sonnet 4.6 call.

{
  "model": "claude-sonnet-4-6",
  "usage": {
    "input_tokens": 1,
    "output_tokens": 67,
    "cache_creation_input_tokens": 287,
    "cache_read_input_tokens": 30433
  }
}

Four numbers. This whole post is those four numbers.

cache_read_input_tokens: 30433. Thirty thousand four hundred thirty-three tokens, served at a tenth of input price. cache_creation_input_tokens: 287, a sliver written fresh at the write premium. input_tokens: 1. One token this turn that the cache did not hold. Total cost on that call: approximately $0.010. That call would have cost roughly nine cents without caching. Multiply by how many calls you made yesterday.

The 90% discount, and the write premium that offsets it

Ten percent of input. That is the cached-read rate. A 50,000-token system prompt on Sonnet 4.6 runs $0.15 uncached and $0.015 cached.

Cache writes cost more than input, not less. Twenty-five percent more on the 5-minute TTL. A full hundred percent more on the 1-hour. You pay that premium when the cache is first populated, and you earn it back across later reads, assuming enough reads land before the TTL runs out. Miss the window, you paid for storage you never came back for.

Math for a session that writes once and reads ninety-nine times on a 50,000-token prefix. Sonnet 4.6. As of 2026-04-14.

Write: 50,000 × $3/M × 1.25 = $0.1875
Reads: 50,000 × 99 × $0.30/M = $1.485
Total:                          $1.67

Uncached equivalent: 50,000 × 100 × $3/M = $15.00

$1.67 against $15.00. Break-even hits on the second read: that is the point where write-plus-one-read crosses back under uncached times two. Every read after the second is money off. A session that fires exactly one call comes out behind: write premium paid, no reads to earn it back.

Per-million rates on 2026-04-14

Per million tokens, Anthropic’s pricing page, fetched 2026-04-14 ^[1].

Model	Input	Output	Write (5m)	Read
Opus 4.7	$5.00	$25.00	$6.25	$0.50
Opus 4.6	$5.00	$25.00	$6.25	$0.50
Sonnet 4.6	$3.00	$15.00	$3.75	$0.30
Haiku 4.5	$1.00	$5.00	$1.25	$0.10

5-minute write is input times 1.25. 1-hour write is input times 2. Read is input times 0.1 at both TTLs.

Anthropic moves these numbers around. Put the date on the table or do not publish the table.

Minimum prefix thresholds

Under the threshold, caching is inactive. No warning field, no error. cache_read_input_tokens returns 0 on every call.

Model	Minimum cacheable prefix
Opus 4.7, Opus 4.6, Opus 4.5	4,096 tokens
Sonnet 4.6	2,048 tokens
Sonnet 4.5, 4.1, 4	1,024 tokens
Haiku 4.5	4,096 tokens
Haiku 3.5	2,048 tokens

Haiku 4.5 does not follow the pattern. The 3.5 line was 2,048. The 4.5 line is 4,096 ^[1]. A 3,800-token system prompt on Haiku 4.5 does not activate caching. cache_read_input_tokens stays at 0 while you pay full input price.

Close to the line? Pad across it. A glossary. A style block. Anything stable that lives at the top. Write premium pays itself off inside two reads.

What invalidates the cache

Byte-prefix match from position zero. That is the entire matching algorithm. Not semantic. Nothing file-level. A byte-by-byte crawl starting at the front of the payload, and the first disagreement is where caching quits. Everything downstream of that disagreement pays input price.

The payload has a fixed running order: tools first, then system prompt, then messages. A change up top pushes everything below it out of the cache. One edited tool definition at position zero and the whole rest of your 50k-token prefix pays input price this call.

Six documented invalidators, direct from Anthropic ^[1]:

Tool definitions changed
Web search or citations toggled
Speed setting changed
Images added or removed
Thinking parameters changed
tool_choice parameter changed

Additional invalidators with the same symptom profile:

Add or remove an MCP server (adding a server rewrites the entire tools layer, which sits at position zero, which takes the system prompt and messages with it; the full cost math is in The MCP Tool Prefix Trap)
Edit the system prompt or CLAUDE.md
Switch models mid-session (different tokenizer, different cache bucket)
Drop anything dynamic into the system prompt: a timestamp, a session ID, a build hash

That last item is the silent one. A pre-tool hook like this looks harmless.

{
  "hooks": {
    "PreToolUse": [{
      "command": "echo \"Current time: $(date)\""
    }]
  }
}

That timestamp lands in the system prompt. Every request writes a fresh cache entry. cache_read_input_tokens stays at 0 for the life of the session.

Dynamic content belongs in the messages layer, not the system prompt. Anything up there that changes per-call invalidates every cached token below it.

The telemetry gate that cut subscription TTL from 1h to 5m

Up until v2.1.108 shipped on 2026-04-15, the 1-hour TTL on Claude Code subscriptions hid behind a client-side feature flag. Flag lookup happened via telemetry; with it disabled, the flag never loaded and the TTL fell back to 5 minutes.

Twelve times the TTL. No setting changed. Same prompts going out. More writes on every call.

Situation (pre-v2.1.108)	Main TTL	Subagent TTL
API customer	5m	5m
Claude Pro / Max subscription	1h	5m (intentional)
Subscription + telemetry OFF	5m (gate disabled)	5m

Subagents are deliberately held at 5 minutes even for subscribers. Anthropic’s explanation ^[2]: TTL selection is per request type; 1-hour on one-shot calls costs more, not less.

If telemetry was disabled before the fix shipped, upgrade to v2.1.108 or later. That restores the 1-hour TTL.

Reading the usage block

Every response ships with the block. Four fields, one job each.

cache_read_input_tokens: tokens the cache matched, billed at input times 0.1. Most of your input should land here.

cache_creation_input_tokens: fresh writes at input times 1.25 (5m TTL) or input times 2 (1h TTL). After the first call of a session this should collapse toward zero. When it keeps coming back equal to your whole prefix, something upstream is invalidating.

input_tokens: new tokens this turn added after the last cache breakpoint. Not total input. Prefix of 50,000 tokens with this field at 1 means 49,999 tokens routed through one of the cache lanes.

output_tokens: reply token count. Billed at the output rate.

Total input is cache_read + cache_creation + input_tokens. Reconciling from input_tokens alone misreads the bill.

Three failure modes diagnosable from the usage block alone, listed roughly from most common down to least common.

cache_read_input_tokens: 0 on every call. Caching is not enabled, the prefix is short of the minimum, or something at the top of the payload changes every single request. Nine times out of ten it is the top of the payload. Start there. If it comes back zero on every call across a long session, you are the customer Anthropic thanks by name at the earnings call.

cache_creation_input_tokens equal to your whole prefix, every single call. The TTL is expiring between calls. Turns are running too slow for 5-minute. Two fixes: upgrade to 1-hour, or fire a cheap keepalive to keep the window warm.

High reads on some calls, zero on others. A parameter is flipping between requests. Usually tool_choice. Sometimes an image attachment that snuck in halfway. Sometimes the telemetry gate described above.

Claude Code emits all four of these fields via OpenTelemetry, per API call, per agent spawn. Dashboards, alerting rules, and field-by-field gotchas are in the agent optimization post.

The MCP server cost trap. Adding an MCP server mid-session invalidates the tools layer and everything downstream, triggering full prefix re-writes on every subsequent call.

Skills vs Rules vs Read calls. Placement decides whether shared content caches across agent types or only within a single session.

The subagent caching hole. Subagents do not inherit the parent session’s cache, and short subagent prompts often fall under the minimum-prefix threshold.

First check: grep cache_read_input_tokens in a recent log. If it is zero, start with the minimum-prefix threshold and the cache-buster list above.

Fact Check

Prompt caching — Anthropic API docs: pricing, TTL options, minimum prefix thresholds, and the list of cache-busting parameters. Fetched 2026-04-14.
anthropics/claude-code#46829: Anthropic explanation of the 1-hour TTL subscription rollout, the telemetry gate, and the deliberate 5-minute decision for subagents. v2.1.108 release notes cite this issue.