LumoMate
Blog/Essays/AI Infrastructure

Prompt Caching for Production LLM Apps in 2026: An Honest Cost-Control Playbook

An honest 2026 playbook for prompt caching in production LLM apps. How Anthropic, OpenAI, and Google price cached input; minimum token thresholds and TTLs; how to structure prompts for >80% cache hit rates without locking yourself in.

Key takeaways

  • Prompt caching is now a first-class feature on Anthropic Claude, OpenAI, and Google Gemini — and it is the single biggest cost lever most small teams have not yet pulled.
  • The three vendors differ on price and on mental model: Anthropic asks you to mark cache breakpoints explicitly and charges cache reads at 0.1× of base input; OpenAI applies a 50% discount automatically on prompts ≥ 1,024 tokens; Google Gemini supports both an implicit auto-cache and an explicit cache with developer-controlled TTL.
  • The win shows up only when your prompt is structured prefix-first. Static system prompt → fixed few-shot → reference docs → growing history → unique user query. Any change in earlier layers invalidates everything after it.
  • Common pitfalls: prompts spaced longer than the TTL between requests, drifting whitespace in system prompts, and output-heavy workloads where caching only helps the input side.
  • For small teams, the right pattern is “cache where you already pay for repetition,” not “cache everywhere.” Pick one or two hot paths, measure hit rate, then expand.
Diagram 1 — conceptual view of Prompt Caching Production Llm
FIG. 1Key takeaways — a one-glance view of the structure described in this section.

Why this is a 2026 conversation

Prompt caching is not new — Anthropic shipped its beta in late 2024 and OpenAI followed shortly after. What changed in 2026 is that the feature is now general-availability on every frontier vendor, with the cost mechanics documented well enough that you can actually plan around them. For a small team running a customer-facing AI feature with the same system prompt across thousands of requests, this is the difference between a token bill that scales linearly with users and one that flattens.

Diagram 2 — conceptual view of Prompt Caching Production Llm
FIG. 2Why this is a 2026 conversation — a one-glance view of the structure described in this section.

Two underused facts make the case concrete. On Anthropic, a cache read is billed at 0.1× the base input price, which is a 90% discount on the cached portion of every subsequent call within the cache TTL. On OpenAI, cache hits on prompts ≥ 1,024 tokens are billed at 50% of base input and applied automatically. Neither vendor charges you for misses; you just pay the normal input rate.

What “prompt caching” actually does

All three vendors implement the same shape with different ergonomics: the model server keeps a fingerprint of the longest stable prefix of your prompt and, when a new request comes in that begins with the same bytes, skips the prefill work for that prefix. You get two benefits — a lower per-token price on the cached input and a faster time-to-first-token because prefill is the slow part of inference.

The differences live in how you opt in, how long the cache lasts, and what the minimum cacheable size is. Those are also the three places small-team implementations get tripped up.

The 2026 vendor matrix

  • Vendor — Opt-in shape — Cache-read price — Default TTL — Minimum cacheable tokens
  • Anthropic Claude — Explicit — mark up to four cache_control breakpoints in the prompt — 0.1× base input on cache reads. Cache writes cost 1.25× (5-min TTL) or 2× (1-hour TTL). — 5 minutes; 1-hour option at 2× write cost — 1,024 tokens for Sonnet 4.6 / 4.5; 4,096 tokens for Opus 4.x and Haiku 4.5
  • OpenAI — Automatic — no flag needed — 50% discount on the cached prefix; misses billed at full input rate — Not user-controlled; cache survives for the model server’s eviction window — 1,024 tokens; matches extend in 128-token increments
  • Google Gemini — Two modes — implicit (auto, on by default for Gemini 2.5+) and explicit (developer-managed) — Discount varies by model; explicit cache adds an hourly storage fee for the cached bytes — Default 1 hour for explicit cache; configurable via ttl or expire_time — 1,024 tokens for Gemini 2.5 Flash / 3 Flash; 4,096 tokens for Gemini 2.5 Pro / 3 Pro

Two non-obvious entries deserve flagging. Anthropic’s 5-minute TTL is short by design — it is meant for an active conversation, not for a feature that fires once an hour. If your traffic is bursty, either pay the 2× write premium for the 1-hour TTL or accept that you will be paying cache-write prices on most requests. And Google’s explicit cache charges a per-hour storage fee on the cached bytes; for sparse workloads, that storage can exceed the savings, which is the only spot in this matrix where caching can be a net loss.

Prompt structure that actually caches

The caching model is prefix-based across all three vendors. That means the order of your prompt is a feature, not a styling choice. The pattern that consistently produces >80% hit rates on production traffic is:

  1. System prompt and tool definitions first. These rarely change. If you can put your system prompt and tool/JSON-schema definitions in a single block at the top, you have already done most of the work.
  2. Fixed few-shot examples next. If you A/B test these, treat each variant as its own cache lane — do not interleave them in one prompt template.
  3. Reference documents (retrieval, code context) third. Cache these only if they will be re-used. For long-document Q&A the same document is hit dozens of times; cache it. For one-shot RAG over different documents per request, do not bother — there is nothing to reuse.
  4. Conversation history fourth. Up to the previous turn, history is cacheable on the next request; the current user turn is not.
  5. Current user message last. This is always unique. Anything you place after it is unreachable from cache, even if it is identical across requests.

This is also why “same prompt, different result” bugs sometimes correlate with caching. A trailing whitespace change or a reordered tool definition will invalidate the prefix and force a fresh prefill at full cost. Treat your system prompt as a versioned artifact, not a string you concatenate at request time.

A two-week rollout plan

  • Week — Action — What you should see by Friday
  • 1 · Mon–Wed — Pick one hot path. Measure: average request size, share of prompt that is identical across requests, and current input-token spend on that path. — A one-page baseline: tokens per request, % stable prefix, monthly input cost.
  • 1 · Thu–Fri — Refactor that path’s prompt into stable-first order. On Anthropic, add explicit cache_control breakpoints after the system block and after any large reference document. On OpenAI, confirm the prefix exceeds 1,024 tokens. On Gemini 2.5+, the implicit cache will activate without code changes. — A prompt template with one or two cache anchors and a unit test that fails if the prefix changes.
  • 2 · Mon–Wed — Ship behind a feature flag. Log usage.cache_creation_input_tokens and usage.cache_read_input_tokens (Anthropic) or the equivalent fields on OpenAI / Gemini. Compute hit rate per route. — A dashboard showing cache-hit rate and the cached vs uncached token mix per request.
  • 2 · Thu–Fri — Compare actual spend against the baseline. Decide which paths to migrate next based on observed hit rate, not theoretical savings. — A short memo: which routes are worth caching, which are not, and why.

Failure modes worth pre-mortem’ing

  • Sparse traffic, short TTL. If your feature fires every twenty minutes and you are on Anthropic’s default 5-minute cache, every call is a cache write. Either switch to the 1-hour cache and accept the 2× write premium, or keep a small warm-up cron that touches the prefix inside the TTL.
  • Prompt drift you did not notice. A frontend that injects the current date into the system prompt looks innocent and invalidates the cache every midnight. Move time-variant content out of the cached prefix.
  • Caching the wrong thing. Caching is input-side only. If your bill is dominated by output tokens — long generations, agent traces — caching helps the latency but the cost gain is small. Look at the input/output split before you celebrate.
  • Quietly locking yourself in. Vendor-specific cache markers in your prompt template make a future migration harder than it has to be. A thin adapter that injects the right cache directive for the current vendor keeps the prompt portable.

Sources

  • Spec-Driven Development for small teams in 2026 — when it pays off, when it’s overkill
  • AI coding assistants for small teams in 2026: a no-hype buyer’s guide
Monday 08:00 — every week

One letter a week,
lasting understanding.

Only essays that don't get scrolled past. No ads, no tracking pixels, no external linkbait — the letter ends inside your inbox.

One-click unsubscribe. No spam.