Prompt Caching Explained: Save Tokens on Repeated Prompts

My API bill hit $400 in one week. The fix was not just caching. I had to structure my prompts so caching could work. Here is how prompt caching works across OpenAI, Anthropic, and Google, and how sinc-LLM's structured format gets the most cache hits.

What Is Prompt Caching?

Prompt caching saves copies of computed attention states for the start of your prompt. When your next prompt starts the same way, the provider reuses those saved states. It does not recompute them. This cuts latency by 50-80% and cost by up to 90% on the cached part.

One key detail: caching only works on prefixes. The shared part must start at the very beginning. If your prompt changes in the first word, nothing gets cached.

How Caching Works Per Provider

Provider	Cache Mechanism	Min Prefix	Discount	TTL
OpenAI	Automatic prefix caching	1,024 tokens	50% off input	5-10 min
Anthropic	Explicit cache_control blocks	1,024 tokens (Sonnet), 2,048 (Haiku)	90% off cached	5 min
Google (Gemini)	Context caching API	32,768 tokens	75% off cached	Configurable

Why Structured Prompts Cache Better

This is where sinc-LLM helps a lot. When every prompt follows the same 6-band structure, the first bands stay the same across requests. That steady opening is the prefix the cache saves.

Say you run a pipeline that writes 50 product descriptions. With plain prompts, every prompt looks different. Different words up front, different order. The cache never matches. Cache hit rate: near zero.

With sinc-LLM structured prompts, the first 3 bands (PERSONA, CONTEXT, DATA schema) are the same for all 50 requests. Only bands 4 and 5 (FORMAT and TASK) change per product. Cache hit rate: 60-80% of input tokens.

x(t) = Σ x(nT) · sinc((t - nT) / T)

Optimizing Band Order for Caching

The sinc-LLM band order (PERSONA, CONTEXT, DATA, CONSTRAINTS, FORMAT, TASK) is not random. It goes from the most stable parts to the most variable ones:

PERSONA (n=0): Almost never changes inside a pipeline. Cache-friendly.
CONTEXT (n=1): Changes between projects, but stays the same inside one pipeline. Cache-friendly.
DATA (n=2): The schema stays the same, but specific values change. Partly cacheable.
CONSTRAINTS (n=3): Usually the same for one task type. Partly cacheable.
FORMAT (n=4): Stays the same across a pipeline. Cache-friendly.
TASK (n=5): Changes with every request. Not cacheable.

Because stable bands come first, they form a long shared prefix. That gives the cache the most to work with.

Anthropic Cache Control: Practical Example

Anthropic's caching is the strongest because you mark the cache cut point yourself. With sinc-LLM structured prompts, put the cache breakpoint right after CONSTRAINTS (n=3):

system: [
  {"type": "text", "text": "[PERSONA + CONTEXT + DATA + CONSTRAINTS]", "cache_control": {"type": "ephemeral"} },
]
user: "[FORMAT + TASK for this specific request]"

The first 4 bands get cached at a 90% discount. The last 2 bands run fresh each time. On a 2,000-token prompt where 1,400 tokens are in bands 0-3, you save 90% on 1,400 tokens per request.

Real Cost Impact

Here is what my product description pipeline cost before caching:

50 requests times 2,000 tokens equals 100,000 input tokens
Cost at $3 per million tokens: $0.30 per batch

After sinc-LLM structure and Anthropic caching:

First request: 2,000 tokens at full price = $0.006
49 requests: 600 tokens at full price plus 1,400 at 10% = 49 times ($0.0018 plus $0.00042) = $0.109
Total: $0.115 per batch, a 62% savings

At 10 batches per day, that is $5.55 saved per day, or $166 per month. All from prompt structure alone.

The sinc JSON Structure Enables Systematic Caching

{
  "formula": "x(t) = \u03a3 x(nT) \u00b7 sinc((t - nT) / T)",
  "T": "specification-axis",
  "fragments": [
    {"n": 0, "t": "PERSONA", "x": "Expert data scientist with 10 years ML experience"},
    {"n": 1, "t": "CONTEXT", "x": "Building a recommendation engine for an e-commerce platform"},
    {"n": 2, "t": "DATA", "x": "Dataset: 2M user interactions, 50K products, sparse matrix"},
    {"n": 3, "t": "CONSTRAINTS", "x": "Must use collaborative filtering. Latency under 100ms. No PII in logs. Python 3.11+. Must handle cold-start users with content-based fallback"},
    {"n": 4, "t": "FORMAT", "x": "Python module with type hints, docstrings, and pytest tests"},
    {"n": 5, "t": "TASK", "x": "Implement the recommendation engine with train/predict/evaluate methods"}
  ]
}

This structure makes caching a natural result of good prompt engineering, not an extra step. Structure your prompts with sinc-LLM, turn on provider caching, and your costs will drop.

// Production AI Engineering

Build AI systems that hold up in production.

sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.

See what we do →