Mario Alexandre  ·  March 26, 2026  ·  token-savings auto-scatter llm-costs

The $42 Hack That Saved Me $1,588

I want to show you some simple math. It is about a tool I built called the auto-scatter hook. The numbers are almost too good to believe.

38x
ROI — every $1 spent on Haiku scatter saves $38 in main model costs

The Math

Here is how it works. I send my prompts through Claude Haiku first. Haiku is the small, cheap AI model. It reads my prompt and breaks it into 6 parts: PERSONA, CONTEXT, DATA, CONSTRAINTS, FORMAT, and TASK. It sends those parts back as JSON. That JSON goes in as extra context before the big, expensive model ever reads my original prompt.

Cost of one Haiku scatter call: $0.002.

Value saved per scatter call in avoided clarification exchanges: $0.08.

Net: $0.078 saved per call. Every single time.

Over 21,194 prompts in 7 days:
Haiku spend: $42.39
Main model savings from reduced exchange rate: $1,588.56
Net gain: $1,546.17

Why This Works

The expensive model (Claude Sonnet or whatever you use) costs a lot per token. But the per-token price is not what hurts most. The real problem is the number of back-and-forth exchanges. When the model does not have enough context to answer in one shot, it asks a question. You answer. It asks another. You answer again. Each exchange adds output tokens. It also makes the input context bigger for every future exchange in that same conversation.

Before auto-scatter, I got 4.2 assistant responses for every prompt I sent. After auto-scatter, that dropped to 1.6. That is a drop of 2.6 exchanges per prompt. Multiply that by 21,194 prompts, count the output tokens and the compounded input tokens, and you get the $1,588.56 in savings.

The Haiku scatter call stops the back-and-forth loop. It gives the expensive model the full picture before that model writes even one token. The expensive model never needs to ask a question. It already has what it needs.

sinc-LLM — signal reconstruction from 6 frequency bands
x(t) = Σ x(nT) · sinc((t - nT) / T)

What the Scatter Actually Does

When I type something like "fix the payment webhook", the Haiku scatter reads it and figures out:

{
  "formula": "x(t) = Σ x(nT) · sinc((t - nT) / T)",
  "T": "specification-axis",
  "fragments": [
    {"n": 0, "t": "PERSONA", "x": "Senior backend engineer familiar with webhook architectures"},
    {"n": 1, "t": "CONTEXT", "x": "Working in a FastAPI codebase with Stripe webhook endpoints"},
    {"n": 2, "t": "DATA", "x": "Webhook validation, event parsing, idempotency patterns"},
    {"n": 3, "t": "CONSTRAINTS", "x": "Minimal change footprint. No schema changes. Must pass existing tests."},
    {"n": 4, "t": "FORMAT", "x": "Code diff with brief inline comments"},
    {"n": 5, "t": "TASK", "x": "Identify and fix the root cause of the payment webhook failure"}
  ]
}

That JSON goes in as system context. The main model sees my original "fix the payment webhook" message. It also sees this fully structured breakdown of it. The model knows who it is, what the situation is, what the limits are, and what format to use. It just does the job. No back-and-forth.

The Local Model Option

The $42 in Haiku calls over 7 days is already a great deal. But I went further. I fine-tuned a local 7B model to do the same scatter job. The model is Qwen2.5-7B, trained on scatter examples. It runs at 290 tok/s on my RTX 5090. The GGUF file is 4.7GB. Training took just 107 seconds.

At zero API cost for scatter, the monthly savings from reduced exchanges on my workflow projects comes to $1,500+/month. That is a 97% cost reduction from a scatter layer that costs $0 to run (if you have a local GPU).

If you do not have a local GPU, Haiku scatter still gives you a 61% cost reduction and a 38x ROI. Both options are open source. Leave a comment and I will share the links.

// Production AI Engineering

Build AI systems that hold up in production.

sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.

See what we do →