The Haiku scatter hook saves $1,588/week for me at 38x ROI. But I wanted to take it further — zero API cost for scatter, completely local, running on my own GPU. So I fine-tuned a model. It took 107 seconds. The results are wild.
| Property | Value |
|---|---|
| Base model | Qwen2.5-7B |
| Training time | 107 seconds |
| Hardware | RTX 5090 |
| GGUF size | 4.7GB |
| Inference speed | 290 tok/s |
| API cost | $0 per call |
| Training examples | ~2,400 (from real scatter outputs) |
The fine-tune needed examples of (raw_prompt, sinc_json) pairs. I generated them from real Haiku scatter outputs. Every prompt I sent through the Haiku scatter server over 7 days got logged — both the input and the structured output. 21,194 calls. I filtered for high-quality outputs (valid JSON, all 6 bands populated, CONSTRAINTS band longest), ending up with about 2,400 clean training examples.
# Training data format (JSONL)
{"input": "fix the auth bug",
"output": "{\"formula\": \"x(t) = ...\", \"T\": \"specification-axis\",
\"fragments\": [{\"n\":0,\"t\":\"PERSONA\",\"x\":\"...\"},
{\"n\":1,\"t\":\"CONTEXT\",\"x\":\"...\"},
{\"n\":2,\"t\":\"DATA\",\"x\":\"...\"},
{\"n\":3,\"t\":\"CONSTRAINTS\",\"x\":\"...\"},
{\"n\":4,\"t\":\"FORMAT\",\"x\":\"...\"},
{\"n\":5,\"t\":\"TASK\",\"x\":\"...\"}]}"}
# ... 2,399 more examples
I used Unsloth for the fine-tune — it's insanely fast on modern GPUs. Qwen2.5-7B as the base, LoRA adapters, 4-bit quantization during training. 107 seconds on the RTX 5090 for the full training run on 2,400 examples.
# Simplified training config
model = FastLanguageModel.from_pretrained("Qwen/Qwen2.5-7B")
model = FastLanguageModel.get_peft_model(model,
r=16, lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"])
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=2048,
num_train_epochs=3,
per_device_train_batch_size=4,
)
trainer.train() # 107 seconds on RTX 5090
Export to GGUF for llama.cpp inference. Q5_K_M quantization gives 4.7GB with negligible quality loss vs. float16.
Honest comparison: the local model is slightly worse than Haiku on edge cases and very short prompts (1-3 words). For standard prompts (5+ words with some context), quality is essentially identical — both produce valid 6-band sinc JSON with appropriate CONSTRAINTS and FORMAT inference.
The local model has higher variance on CONSTRAINTS band detail — sometimes it generates more generic constraints than Haiku would. But it's running at zero API cost, so I can run multiple scatter calls for important prompts and pick the best output if needed.
At $0 per scatter call and my 7-day volume (21,194 prompts), the savings from exchange rate reduction (1.6 vs. 4.2 exchanges/prompt) are $1,631 — essentially the full $1,588 plus the $42 Haiku spend I no longer pay.
Monthly projection: $1,500+ in savings, $0 in scatter costs. 97% cost reduction from my unstructured baseline. That's the full power of local scatter — the ROI is unlimited because the denominator is zero.
The GGUF is included in the open-source repo. You need an NVIDIA GPU — RTX 3090 or better for comfortable inference speed. On a 3090 it runs at ~150 tok/s which is plenty fast for scatter calls. Leave a comment and I'll send the GitHub link.
Try sinc-LLM free — sincllm.com
GGUF and fine-tune code in the GitHub repo. Leave a comment for the link.