Mario Alexandre  ·  March 26, 2026  ·  open-source local-model token-savings

From Haiku to Local Model: Zero-Cost Scatter With a 7B GGUF

The Haiku scatter hook saves $1,588/week for me at 38x ROI. But I wanted to take it further — zero API cost for scatter, completely local, running on my own GPU. So I fine-tuned a model. It took 107 seconds. The results are wild.

The Model Specs

PropertyValue
Base modelQwen2.5-7B
Training time107 seconds
HardwareRTX 5090
GGUF size4.7GB
Inference speed290 tok/s
API cost$0 per call
Training examples~2,400 (from real scatter outputs)
sinc-LLM — same output format from local model
x(t) = Σ x(nT) · sinc((t - nT) / T)

How I Generated the Training Data

The fine-tune needed examples of (raw_prompt, sinc_json) pairs. I generated them from real Haiku scatter outputs. Every prompt I sent through the Haiku scatter server over 7 days got logged — both the input and the structured output. 21,194 calls. I filtered for high-quality outputs (valid JSON, all 6 bands populated, CONSTRAINTS band longest), ending up with about 2,400 clean training examples.

# Training data format (JSONL)
{"input": "fix the auth bug",
 "output": "{\"formula\": \"x(t) = ...\", \"T\": \"specification-axis\",
  \"fragments\": [{\"n\":0,\"t\":\"PERSONA\",\"x\":\"...\"},
  {\"n\":1,\"t\":\"CONTEXT\",\"x\":\"...\"},
  {\"n\":2,\"t\":\"DATA\",\"x\":\"...\"},
  {\"n\":3,\"t\":\"CONSTRAINTS\",\"x\":\"...\"},
  {\"n\":4,\"t\":\"FORMAT\",\"x\":\"...\"},
  {\"n\":5,\"t\":\"TASK\",\"x\":\"...\"}]}"}
# ... 2,399 more examples

The Training

I used Unsloth for the fine-tune — it's insanely fast on modern GPUs. Qwen2.5-7B as the base, LoRA adapters, 4-bit quantization during training. 107 seconds on the RTX 5090 for the full training run on 2,400 examples.

# Simplified training config
model = FastLanguageModel.from_pretrained("Qwen/Qwen2.5-7B")
model = FastLanguageModel.get_peft_model(model,
    r=16, lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"])

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=2048,
    num_train_epochs=3,
    per_device_train_batch_size=4,
)
trainer.train()  # 107 seconds on RTX 5090

Export to GGUF for llama.cpp inference. Q5_K_M quantization gives 4.7GB with negligible quality loss vs. float16.

Quality vs. Haiku

Honest comparison: the local model is slightly worse than Haiku on edge cases and very short prompts (1-3 words). For standard prompts (5+ words with some context), quality is essentially identical — both produce valid 6-band sinc JSON with appropriate CONSTRAINTS and FORMAT inference.

The local model has higher variance on CONSTRAINTS band detail — sometimes it generates more generic constraints than Haiku would. But it's running at zero API cost, so I can run multiple scatter calls for important prompts and pick the best output if needed.

The Cost Math at Zero Scatter Cost

At $0 per scatter call and my 7-day volume (21,194 prompts), the savings from exchange rate reduction (1.6 vs. 4.2 exchanges/prompt) are $1,631 — essentially the full $1,588 plus the $42 Haiku spend I no longer pay.

Monthly projection: $1,500+ in savings, $0 in scatter costs. 97% cost reduction from my unstructured baseline. That's the full power of local scatter — the ROI is unlimited because the denominator is zero.

The GGUF is included in the open-source repo. You need an NVIDIA GPU — RTX 3090 or better for comfortable inference speed. On a 3090 it runs at ~150 tok/s which is plenty fast for scatter calls. Leave a comment and I'll send the GitHub link.

Try sinc-LLM free — sincllm.com

GGUF and fine-tune code in the GitHub repo. Leave a comment for the link.