You have probably looked at your Anthropic or OpenAI bill and asked: why is this so expensive? You have probably tried the obvious things. You switched to smaller models. You cached responses. You shortened your system prompts. Maybe you even tried writing better prompts for a week before giving up.
None of that fixed it. The reason is simple: the actual problem is not what you think it is.
The problem is your exchange rate. Most people have never heard of this as a metric.
Exchange rate, in the way I use it here, is the number of AI responses you get per prompt you send. In a perfect world, every prompt gets one good answer. That is an exchange rate of 1.0.
In the real world, the model often does not have enough information to answer correctly on the first try. So it asks a clarifying question. Or it answers the wrong question and you have to correct it. Or it does half the task and you prompt again for the rest. Each extra exchange multiplies your cost.
I measured my own exchange rate over 7 days across 21,194 prompts. It was 4.2. Every task that should cost 1 unit of tokens was actually costing 4.2 units. It gets worse: each exchange also carries the full conversation history forward. That compounds the input token cost even more.
People constantly blame the model. They say things like: Claude is too wordy. GPT-4 asks too many questions. Gemini does not follow instructions.
The model is doing exactly what it should do. When it lacks information, it asks. That is correct behavior. The real problem is that your prompts are not giving it enough to work with.
A structured prompt fixes this. It tells the model who it should be, what the context is, what data matters, what limits apply, what format to use, and what the task actually is. With all of that in one prompt, the model gets everything it needs up front. You get your answer in one shot.
Here is why the cost grows so fast. Each exchange in a conversation carries the full history of every exchange before it. So if you need 10 exchanges to finish one task, the last few exchanges are paying input tokens for all 10 turns of history.
In my measurement: 12.9 million output tokens in 7 days. 7.14 billion total tokens across the whole system. Output tokens cost more per token. But the huge volume of input tokens from compounding context is where the surprise bill comes from.
At 4.2 exchanges per prompt versus 1.6, I saved $1,588.56 over 7 days. That is a 61% reduction. My projected cost at the old rate was $2,597.96. My actual cost at the new rate was $967.01.
Many people try to cut costs by switching to cheaper models. That helps a little. But it misses the main problem. If your exchange rate is 4.2, switching from Claude Sonnet to Claude Haiku lowers your per-token cost. It does not lower your exchange count. You still get 4.2 responses per prompt. They are just cheaper per token.
The real fix is lowering your exchange rate. To do that, give the model complete information on the very first prompt, every time. That takes a structural change in how you build prompts. Switching price tiers does not solve it.
I built an auto-scatter hook. It is a tiny server that intercepts every prompt before it reaches the main model. It calls Claude Haiku to split the prompt into 6 structured bands: PERSONA, CONTEXT, DATA, CONSTRAINTS, FORMAT, and TASK. Then it injects that structure as system context. The hook costs $0.002 per call. It saves $0.08 per call in avoided exchange overhead. That is a 38x return.
The code is open source. Leave a comment and I will drop the link. Setup takes about 15 minutes. If your bill is anywhere near what mine was, it pays for itself in the first hour of use.
// Production AI Engineering
sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.
See what we do →