The Haiku scatter hook saves me $1,588 per week at a 38x return. I wanted to go further. I wanted zero API cost for scatter, running fully on my own GPU. So I fine-tuned a model. It took 107 seconds. The results are impressive.
| Property | Value |
|---|---|
| Base model | Qwen2.5-7B |
| Training time | 107 seconds |
| Hardware | RTX 5090 |
| GGUF size | 4.7GB |
| Inference speed | 290 tok/s |
| API cost | $0 per call |
| Training examples | ~2,400 (from real scatter outputs) |
The fine-tune needed pairs of (raw_prompt, sinc_json) examples. I made them from real Haiku scatter outputs. Every prompt I sent through the Haiku scatter server over 7 days got logged. That includes the input and the structured output. 21,194 calls total. I kept only high-quality outputs (valid JSON, all 6 bands filled, CONSTRAINTS band the longest). That left about 2,400 clean training examples.
# Training data format (JSONL)
{"input": "fix the auth bug",
"output": "{\"formula\": \"x(t) = ...\", \"T\": \"specification-axis\",
\"fragments\": [{\"n\":0,\"t\":\"PERSONA\",\"x\":\"...\"},
{\"n\":1,\"t\":\"CONTEXT\",\"x\":\"...\"},
{\"n\":2,\"t\":\"DATA\",\"x\":\"...\"},
{\"n\":3,\"t\":\"CONSTRAINTS\",\"x\":\"...\"},
{\"n\":4,\"t\":\"FORMAT\",\"x\":\"...\"},
{\"n\":5,\"t\":\"TASK\",\"x\":\"...\"}]}"}
# ... 2,399 more examples
I used Unsloth for the fine-tune. It is very fast on modern GPUs. I started with Qwen2.5-7B as the base model, used LoRA adapters, and trained with 4-bit quantization. The full training run on 2,400 examples finished in 107 seconds on the RTX 5090.
# Simplified training config
model = FastLanguageModel.from_pretrained("Qwen/Qwen2.5-7B")
model = FastLanguageModel.get_peft_model(model,
r=16, lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"])
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=2048,
num_train_epochs=3,
per_device_train_batch_size=4,
)
trainer.train() # 107 seconds on RTX 5090
I exported the model to GGUF format for llama.cpp inference. Q5_K_M quantization makes the file 4.7GB. Quality loss compared to float16 is very small.
Honest comparison: the local model is slightly worse than Haiku on edge cases and very short prompts (1-3 words). For normal prompts (5 or more words with some context), quality is about the same. Both produce valid 6-band sinc JSON with correct CONSTRAINTS and FORMAT output.
The local model sometimes produces more generic CONSTRAINTS band detail than Haiku would. But it runs at zero API cost. So I can run several scatter calls for important prompts and pick the best one.
At $0 per scatter call and my 7-day volume of 21,194 prompts, the savings from exchange rate reduction (1.6 vs. 4.2 exchanges per prompt) are $1,631. That is the full $1,588 plus the $42 Haiku spend I no longer pay.
Monthly projection: $1,500 or more in savings, $0 in scatter costs. That is a 97% cost reduction from my unstructured baseline. The ROI is unlimited because the cost per call is zero.
The GGUF is included in the open-source repo. You need an NVIDIA GPU, RTX 3090 or better, for comfortable inference speed. On a 3090 it runs at around 150 tok/s, which is fast enough for scatter calls. Leave a comment and I will send you the GitHub link.
// Production AI Engineering
sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.
See what we do →