2026-06-05Updated: 2026-07-23By M.R.

Prompt Caching Across Claude, GPT, and Gemini: Architecture Patterns That Actually Work in Production

prompt caching Claude API OpenAI GPT Gemini LLM cost optimization

Three caching implementations. Three completely different cost profiles. Which one fits your stack?

Prompt caching has become the single most underused cost lever in LLM production. Anthropic released Claude's prompt caching in late 2024, OpenAI followed weeks later, and now every frontier model offers some version. But "caching exists" doesn't mean "all caching works the same way."

The implementation differences are not cosmetic. They affect your cache hit rate, your latency math, your TTL strategy, and whether your cost optimization actually survives contact with production. This is what separates a 65% cost reduction from a 15% one.

How Each Implementation Stores and Retrieves Cache

Claude's approach: Explicit breakpoints with 5-minute TTL (now longer for agents)

Claude uses workspace-level isolation and requires you to mark cacheable blocks explicitly with cache control. You decide where the cache boundary sits. Cached entries have a minimum lifetime of 5 minutes (standard) or 1 hour (extended), after which they are promptly, though not immediately, deleted. This matters because Anthropic quietly changed the prompt cache TTL from 60 minutes down to 5 minutes in early 2026. For many production workloads, this single change increased effective API costs by 30–60%.

The tradeoff: You get fine-grained control over what gets cached, but you also have to think about it. Cache hits require 100% identical prompt segments, including all text and images up to and including the block marked with cache control. There is no fuzzy matching, no semantic similarity—just byte-for-byte matching.

OpenAI's approach: Automatic, stateless routing with 5–10 minute in-memory TTL

Prompt Caching works automatically on all your API requests (no code changes required) and has no additional fees associated with it. Prompt Caching is enabled for all recent models, gpt-4o and newer. You don't opt in per block. The API caches the longest prefix of a prompt that has been previously computed, starting at 1,024 tokens and increasing in 128-token increments.

The mechanism is stateless routing: OpenAI routes API requests to servers that recently processed the same prompt, making it cheaper and faster than processing a prompt from scratch. When using the in-memory policy, cached prefixes generally remain active for 5 to 10 minutes of inactivity, up to a maximum of one hour. For longer retention, Extended prompt cache retention keeps cached prefixes active for longer, up to a maximum of 24 hours.

The win: Zero implementation effort. The catch: You're betting that your request lands on the same physical server that holds your cache. Caching only works if two requests share the same prefix and land on the same machine. Take advantage of the optional parameter prompt_cache_key for traffic that shares common prefixes to improve that routing.

Gemini's approach: KV cache with extended retention via Vertex AI

Google integrates prompt caching directly into Vertex AI infrastructure. Unlike Claude's explicit model-level caching and OpenAI's request-level routing, Gemini stores cached key-value tensors. By adding portions of your context to a cache, the model can use the cache to skip recomputation of inputs. Prompt caching can help when you have workloads with long and repeated contexts that are frequently reused for multiple queries. For example, if you have a chatbot where users can upload documents and ask questions about them, it can be time consuming for the model to process the document every time the user provides input. With prompt caching, you can cache the document so that future queries containing the document don't need to reprocess it.

Most models support a 5-minute TTL, while Claude Opus 4.5, Claude Haiku 4.5, and Claude Sonnet 4.5 also support an extended 1-hour TTL option. Minimum cache sizes apply: Claude 3.7 Sonnet requires at least 1,024 tokens per cache checkpoint, while Claude Opus 4.5, Claude Opus 4.6, Claude Haiku 4.5, and Claude Sonnet 4.5 require at least 4,096 tokens per cache checkpoint.

Cost Mechanics: Read Price vs. Write Cost

This is where implementation differences turn into dollars. Here's the real math.

Provider	Cached Input Discount	Write Cost Multiplier	Break-Even Reads	TTL
Claude (default)	~90% (10% of standard price)	1× (write at normal price)	3 reads in 5 min	5 minutes
Claude (1-hour option)	~90% (10% of standard price)	2× (write premium)	5 reads in 60 min	1 hour
OpenAI (in-memory)	~50% discount	No explicit write cost	2–3 reads in 5–10 min	5–10 minutes
OpenAI (extended 24h)	~50% discount	No explicit write cost	2–3 reads in 24 hours	24 hours

The economics work if you have enough reads to amortize the write premium. Rule of thumb: 3+ reads within the TTL for 5-minute cache, 5+ reads for 1-hour cache. That's the engineering version. The business version: if your cache TTL is too short, your cache becomes a write-only money sink.

When Caching Breaks (And How Each Platform Fails Differently)

Claude's failure modes: Subtle cache invalidation

A change to the system prompt invalidates everything, because all later content now sits behind a different prefix. Switching models recomputes the entire request even when the content is identical. Common killers: Timestamps in cached content. "Current time: 2026-04-17T14:32:15Z" in your system prompt invalidates the cache on every request. Move timestamps out of the cached prefix or truncate to the day.

Another silent killer: User-specific content in the prefix. Putting "You are helping {user.name} who works at {user.company}" in the cached system prefix means every user gets a cache miss. Move user-specific content to the user message or split into a per-user cache with longer TTL.

OpenAI's failure modes: Routing misses

After the first 1,024 tokens, cache hits occur for every 128 additional identical tokens. A single character difference in the first 1,024 tokens results in a cache miss, which is characterized by a cached_tokens value of 0. Identical prefix + landing on a different server = cache miss. If requests for the same prefix and prompt_cache_key combination exceed a certain rate (approximately 15 requests per minute), some requests overflow and get routed to extra machines, reducing cache effectiveness.

The implication: high-throughput workloads (batch jobs, scheduled agents) benefit most from explicit prompt_cache_key routing. Low-throughput interactive workloads can't rely on automatic routing to hit the cache consistently.

Gemini's failure modes: Token threshold requirements

Minimum cache sizes are model-specific and enforce discipline. You can't cache a 500-token fragment. The overhead of creating and managing that cache entry isn't worth the upside. This forces you to think about what's actually worth caching at scale.

Architectural Patterns: Three Production Examples

Pattern	Best Fit	Why	Recommended Provider
Multi-turn agent with stable instructions	Agentic workflows, long system prompts, stable tool definitions	Cache stable, reusable content like system instructions, background information, large contexts, or frequent tool definitions. These stay identical across turns.	Claude (1-hour TTL)
RAG with repeated context	Document Q&A, knowledge base chats, retrieval pipelines	Same document retrieved, different queries. Cached document doesn't change per query.	OpenAI (extended 24h) or Claude (1-hour TTL)
Batch processing with bursts	Scheduled jobs, bulk document processing, daily analytics runs	If you process documents in bursts with gaps longer than 5 minutes, your cache expires between runs. Every burst starts cold.	OpenAI (extended 24h, with prompt_cache_key for routing stability)
Few-shot classification	Stable examples + variable input	Examples don't change; only the input token stream changes. Pure prefix-caching win.	OpenAI or Claude (either works; OpenAI simpler)

Real-World Cost Impact

Let's ground this. A typical agent request in 2026 looks like this: 8,000 tokens of system prompt defining the agent's role, guardrails, and output format · 12,000 tokens of tool schemas (every MCP server, every custom function)

Without caching, you pay the full input rate on all 24,200 tokens every turn. With caching, you pay the write-rate once on the static 24,000 tokens, then only $0.30/MTok (10% of base) on every subsequent read, plus full price on the 200 new tokens. On a conversation of 20 turns the total input cost drops from roughly $1.45 to $0.21 — a 7× reduction, and that multiplier grows with both prompt size and turn count.

That 7× number assumes perfect cache hit rate. In production, you'll see 60–80% hit rates on agent workloads if you structure prompts carefully. On batch jobs with scheduling guarantees (same server, same time window), you can reliably hit 85%+.

The Practical Decision Tree

Use Claude if:

You're building multi-turn agents with stable tool sets and long system prompts
You can afford the 5-minute default TTL (or pay for 1-hour TTL when cache hit rate justifies it)
You want explicit control over what gets cached (cache breakpoints per content block)
Your workload can handle the post-TTL cache miss penalty

Use OpenAI if:

You want zero implementation overhead (automatic caching, no code changes)
Your workload has stable, long prefixes that repeat frequently
You can use prompt_cache_key for improved routing. Flex can achieve higher cache hit rates in some workloads. It is particularly well suited for prototyping or production workloads that are not inference-intensive but still benefit from cost optimization.
You're running batch jobs where you control request timing and routing

Use Gemini if:

You're embedded in Google Cloud and want native Vertex AI integration
You need longer TTLs (1 hour+) by default and don't want to pay write premiums
Your document-heavy workloads (RAG, knowledge bases) map cleanly to KV cache semantics

What This Means for Your Team

Prompt caching isn't a free optimization. It requires architectural thinking: which content is stable, how long is your cache TTL, what's your actual hit rate versus theoretical maximum, and do you have enough request volume to break even on write costs?

Prompt caching (both automatic and explicit) is ZDR eligible. This matters for compliance-sensitive workloads. Anthropic does not store the raw text of your prompts or Claude's responses. KV (key-value) cache representations and cryptographic hashes of cached content are held in memory only and are not stored at rest.

The real leverage point isn't choosing between Claude, OpenAI, and Gemini. It's auditing your prompts for cache-killers: dynamic content in the prefix, model switching, inconsistent whitespace, or tool definitions that change per request. Fix those structural issues first. Then pick the caching flavor that matches your TTL and implementation overhead tolerance.

If you're not measuring cache hit rate in your logs today, you're likely leaving 40–60% cost savings on the table. Start there.

Sources

What June 2026 AI Model Releases Actually Tell Us—And What They Don't

Why Claude's Structured Output Schema Compilation Has Hard Limits: Understanding Grammar Complexity Tradeoffs in Production AI

Connector-First, Pixels-Second: How Claude's Tool Architecture Shapes Real-World Automation