Mario Alexandre  ·  March 26, 2026  ·  open-source local-model token-savings

From Haiku to Local Model: Zero-Cost Scatter With a 7B GGUF

The Haiku scatter hook saves me $1,588 per week at a 38x return. I wanted to go further. I wanted zero API cost for scatter, running fully on my own GPU. So I fine-tuned a model. It took 107 seconds. The results are impressive.

The Model Specs

PropertyValue
Base modelQwen2.5-7B
Training time107 seconds
HardwareRTX 5090
GGUF size4.7GB
Inference speed290 tok/s
API cost$0 per call
Training examples~2,400 (from real scatter outputs)
sinc-LLM — same output format from local model
x(t) = Σ x(nT) · sinc((t - nT) / T)

How I Generated the Training Data

The fine-tune needed pairs of (raw_prompt, sinc_json) examples. I made them from real Haiku scatter outputs. Every prompt I sent through the Haiku scatter server over 7 days got logged. That includes the input and the structured output. 21,194 calls total. I kept only high-quality outputs (valid JSON, all 6 bands filled, CONSTRAINTS band the longest). That left about 2,400 clean training examples.

# Training data format (JSONL)
{"input": "fix the auth bug",
 "output": "{\"formula\": \"x(t) = ...\", \"T\": \"specification-axis\",
  \"fragments\": [{\"n\":0,\"t\":\"PERSONA\",\"x\":\"...\"},
  {\"n\":1,\"t\":\"CONTEXT\",\"x\":\"...\"},
  {\"n\":2,\"t\":\"DATA\",\"x\":\"...\"},
  {\"n\":3,\"t\":\"CONSTRAINTS\",\"x\":\"...\"},
  {\"n\":4,\"t\":\"FORMAT\",\"x\":\"...\"},
  {\"n\":5,\"t\":\"TASK\",\"x\":\"...\"}]}"}
# ... 2,399 more examples

The Training

I used Unsloth for the fine-tune. It is very fast on modern GPUs. I started with Qwen2.5-7B as the base model, used LoRA adapters, and trained with 4-bit quantization. The full training run on 2,400 examples finished in 107 seconds on the RTX 5090.

# Simplified training config
model = FastLanguageModel.from_pretrained("Qwen/Qwen2.5-7B")
model = FastLanguageModel.get_peft_model(model,
    r=16, lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"])

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=2048,
    num_train_epochs=3,
    per_device_train_batch_size=4,
)
trainer.train()  # 107 seconds on RTX 5090

I exported the model to GGUF format for llama.cpp inference. Q5_K_M quantization makes the file 4.7GB. Quality loss compared to float16 is very small.

Quality vs. Haiku

Honest comparison: the local model is slightly worse than Haiku on edge cases and very short prompts (1-3 words). For normal prompts (5 or more words with some context), quality is about the same. Both produce valid 6-band sinc JSON with correct CONSTRAINTS and FORMAT output.

The local model sometimes produces more generic CONSTRAINTS band detail than Haiku would. But it runs at zero API cost. So I can run several scatter calls for important prompts and pick the best one.

The Cost Math at Zero Scatter Cost

At $0 per scatter call and my 7-day volume of 21,194 prompts, the savings from exchange rate reduction (1.6 vs. 4.2 exchanges per prompt) are $1,631. That is the full $1,588 plus the $42 Haiku spend I no longer pay.

Monthly projection: $1,500 or more in savings, $0 in scatter costs. That is a 97% cost reduction from my unstructured baseline. The ROI is unlimited because the cost per call is zero.

The GGUF is included in the open-source repo. You need an NVIDIA GPU, RTX 3090 or better, for comfortable inference speed. On a 3090 it runs at around 150 tok/s, which is fast enough for scatter calls. Leave a comment and I will send you the GitHub link.

// Production AI Engineering

Build AI systems that hold up in production.

sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.

See what we do →