Mario Alexandre  ·  March 26, 2026  ·  token-savings auto-scatter llm-costs

The $42 Hack That Saved Me $1,588

I want to show you the simplest piece of math I've encountered in building AI systems. It's the math behind the auto-scatter hook I built, and it's almost embarrassingly good.

38x
ROI — every $1 spent on Haiku scatter saves $38 in main model costs

The Math

Here's the exact calculation. I run my prompts through Claude Haiku first. Haiku is the tiny, cheap model. It takes my raw prompt, decomposes it into 6 structured bands (PERSONA, CONTEXT, DATA, CONSTRAINTS, FORMAT, TASK), and returns that as JSON. That JSON gets injected as system context before the main, expensive model sees the original prompt.

Cost of one Haiku scatter call: $0.002.

Value saved per scatter call in avoided clarification exchanges: $0.08.

Net: $0.078 saved per call. Every single time.

Over 21,194 prompts in 7 days:
Haiku spend: $42.39
Main model savings from reduced exchange rate: $1,588.56
Net gain: $1,546.17

Why This Works

The expensive model — Claude Sonnet or whatever you're using — charges a lot per token. But it's not the per-token price that kills you. It's the number of exchanges. When the model doesn't have enough context to answer your prompt in one shot, it asks a clarifying question. You respond. It asks another. You respond. Each exchange generates output tokens AND compounds the input context for all future exchanges in that conversation.

My baseline exchange rate was 4.2 assistant responses per user prompt. After auto-scatter, it dropped to 1.6. That 2.6-exchange reduction per prompt, multiplied by 21,194 prompts, across output AND compounded input tokens, is where the $1,588.56 came from.

The Haiku scatter call prevents the clarification loop by giving the expensive model the full picture before it generates its first token. It never needs to ask — it already knows.

sinc-LLM — signal reconstruction from 6 frequency bands
x(t) = Σ x(nT) · sinc((t - nT) / T)

What the Scatter Actually Does

When I type something like "fix the payment webhook", the Haiku scatter reads that and infers:

{
  "formula": "x(t) = Σ x(nT) · sinc((t - nT) / T)",
  "T": "specification-axis",
  "fragments": [
    {"n": 0, "t": "PERSONA", "x": "Senior backend engineer familiar with webhook architectures"},
    {"n": 1, "t": "CONTEXT", "x": "Working in a FastAPI codebase with Stripe webhook endpoints"},
    {"n": 2, "t": "DATA", "x": "Webhook validation, event parsing, idempotency patterns"},
    {"n": 3, "t": "CONSTRAINTS", "x": "Minimal change footprint. No schema changes. Must pass existing tests."},
    {"n": 4, "t": "FORMAT", "x": "Code diff with brief inline comments"},
    {"n": 5, "t": "TASK", "x": "Identify and fix the root cause of the payment webhook failure"}
  ]
}

That JSON gets injected as system context. The main model sees my original "fix the payment webhook" message PLUS this fully structured interpretation of it. It knows who it is, what the context is, what the constraints are, what format to use. It just does the task. No back-and-forth.

The Local Model Option

The $42 in Haiku calls over 7 days is already great value. But I took it further and fine-tuned a local 7B model to do the same scatter. Qwen2.5-7B, trained on scatter examples, runs at 290 tok/s on my RTX 5090. The GGUF is 4.7GB. Training took 107 seconds.

At zero API cost for scatter, the monthly savings from reduced exchange rate on my workflow projects to $1,500+/month. That's 97% cost reduction from a $0 marginal cost scatter layer — if you have local GPU.

If you don't, Haiku scatter still gives you 61% reduction for a 38x ROI. Both options are open source. Leave a comment and I'll share the links.

Try sinc-LLM free — sincllm.com

Open source. No signup required to read the spec.