We Fine-Tuned a 7B Model in 107 Seconds to Replace a Cloud API

March 25, 2025 · 8 min read · fine-tuning unsloth local-llm sinc-llm ollama

Contents

  1. The problem with per-call API costs
  2. Teacher-student distillation
  3. Training setup: Unsloth + LoRA
  4. 107 seconds on an RTX 5090
  5. GGUF export and Ollama registration
  6. Results: 9/10 pass, 290 tok/s
  7. What changed for users
x(t) = Σ x(nT) · sinc((t − nT) / T)
The sinc-LLM framework applies Nyquist-Shannon sampling to prompt engineering. Each band is a frequency sample of intent.

The problem with per-call API costs

Every time someone clicks "Transform" on sincllm.com, I decompose their raw prompt into 6 sinc frequency bands: PERSONA, CONTEXT, DATA, CONSTRAINTS, FORMAT, and TASK. That decomposition was running against Claude Haiku at $0.002 per call — and it was adding up.

Two dollars per thousand calls. Sustainable at low volume. Not sustainable at scale, and not the right architecture for a feature that should be instantaneous and free to run. I wanted to own my inference stack, not rent it by the token.

The output is highly structured — JSON with specific band names, specific length constraints, specific invariants (CONSTRAINTS must always be the longest band). I realized this is exactly the kind of task where a small fine-tuned model can match or beat a large general-purpose API, because the output space is narrow and the training signal is clean.

I decided to distill Haiku into a 7B local model, eliminate the API call entirely, and ship the result as "AI Transform" on sincllm.com.

$0.002 Previous cost per transform (Haiku API)
$0.000 Current cost per transform (local model)
107s Total fine-tuning time
290 Tokens/sec at inference

Teacher-student distillation

The approach is straightforward: use a capable teacher model (Haiku) to generate high-quality training examples, then fine-tune a smaller student model (Qwen2.5-7B-Instruct) on those examples. The student learns to mimic the teacher's output format and reasoning without needing the teacher at runtime.

This works well when the task has three properties:

  1. Structured output. The target format is always the same — sinc JSON with 6 named bands. The model isn't being asked to be creative about the format.
  2. Compressible knowledge. The rules for how to fill each band (PERSONA gets a role description, CONSTRAINTS gets the longest text, TASK is the atomic action) fit comfortably in a few thousand tokens of system prompt or training signal.
  3. Input diversity. The training set covers the full range of real inputs — from single words to paragraphs — so the model generalizes rather than memorizing.

I generated 120 training examples using Haiku. Each example is a (prompt, sinc JSON) pair. Prompts range from "hi" (1 word) to 200-word specification documents. Edge cases include "???" (non-linguistic input), "asdf" (gibberish), and single-sentence commands like "fix line 3".

Training setup: Unsloth + LoRA

I used Unsloth for training. Unsloth patches the HuggingFace training loop with hand-written CUDA kernels and memory optimizations. It cuts VRAM usage roughly in half compared to vanilla transformers fine-tuning and runs 2-5x faster. For a 7B model on a single GPU, this matters.

LoRA (Low-Rank Adaptation) means I don't retrain the full 7B parameters. Instead, small low-rank matrices are injected into the attention layers and only those are trained. The base model weights are frozen. This makes training fast and keeps the GGUF export manageable.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj",
                    "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

Training data is formatted in ChatML — the format Qwen2.5-Instruct expects at inference time. Each example becomes a user/assistant turn pair, with the system prompt explaining the sinc decomposition task.

<|im_start|>system
You are a sinc-LLM scatter engine. Decompose the user prompt into
6 sinc frequency bands as valid JSON...
<|im_end|>
<|im_start|>user
we need to fix the auth bug before the demo
<|im_end|>
<|im_start|>assistant
{
  "formula": "x(t) = Σ x(nT) · sinc((t − nT) / T)",
  "T": "specification-axis",
  "fragments": [
    {"n": 0, "t": "PERSONA", "x": "Senior software engineer..."},
    {"n": 1, "t": "CONTEXT", "x": "Pre-demo fix context..."},
    ...
  ]
}
<|im_end|>

107 seconds on an RTX 5090

Training ran on a local RTX 5090. 120 examples, 3 epochs, batch size 2 with gradient accumulation of 4. Effective batch size: 8. Learning rate 2e-4 with cosine decay.

The moment I saw the loss drop from 2.24 to 1.14, I knew the model was actually learning the structure — not just memorizing. That's a clean convergence curve for a structured output task. The model is learning a narrow distribution, not a broad one.

Step   1/45  | Loss: 2.2418
Step  10/45  | Loss: 1.8832
Step  20/45  | Loss: 1.5210
Step  30/45  | Loss: 1.3104
Step  40/45  | Loss: 1.1821
Step  45/45  | Loss: 1.1403
Training complete. Time: 107.3s

107 seconds. I didn't expect 107 seconds to be enough — that's the kind of iteration speed that makes fine-tuning feel more like configuration than research. I could re-run this with 200 examples or adjusted hyperparameters in under 3 minutes total including data prep.

GGUF export and Ollama registration

After training, I merged the LoRA adapter back into the base model weights and exported to GGUF format using llama.cpp's quantization toolchain. I used Q4_K_M quantization — 4-bit with k-quant mixed precision on the most sensitive layers.

# Merge LoRA into base weights
model.save_pretrained_merged("sinc-scatter-merged", tokenizer)

# Export to GGUF Q4_K_M
model.save_pretrained_gguf(
    "sinc-scatter-gguf",
    tokenizer,
    quantization_method="q4_k_m"
)

The resulting file is 4.7GB. Q4_K_M gives a good quality/size tradeoff — perplexity degradation is minimal for structured output tasks where the vocabulary of valid tokens is highly constrained.

Registering with Ollama takes a one-file Modelfile:

FROM ./sinc-scatter-q4_k_m.gguf

SYSTEM """
You are a sinc-LLM scatter engine. Your sole function is to decompose
any raw user prompt into a sinc JSON object with formula, T, and a
fragments array containing exactly 6 bands (n=0..5):
PERSONA, CONTEXT, DATA, CONSTRAINTS, FORMAT, TASK.
CONSTRAINTS must always be the longest band.
Output only valid JSON. No explanation, no preamble.
"""

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 2048
ollama create sinc-scatter -f Modelfile
# → success. Model registered as 'sinc-scatter'

ollama run sinc-scatter "fix the login bug"
# → {"formula": "x(t) = Σ...", "T": "specification-axis", "fragments": [...]}

Results: 9/10 pass, 290 tok/s

I ran 10 validation prompts covering the full range of real inputs. 9 passed on the first attempt with zero post-processing. 1 failed on a malformed edge case ("???") where the model output valid JSON but omitted the formula field. I added a fallback in the API server to inject the formula if missing — all 10 now pass.

Generation speed: 290 tokens per second on the RTX 5090. A typical sinc JSON output is 350-500 tokens. That's a 1.2-1.7 second round trip from request to complete JSON, including the HTTP overhead through the SSH tunnel.

What surprised me most: the CONSTRAINTS band invariant holds across all 10 test cases without any post-processing. CONSTRAINTS is always the longest band, as specified. The model has internalized the rule, not just copied it from examples. That's when it clicked — this isn't template filling, it's genuine generalization.

What changed for users

On sincllm.com, there are now two buttons: "Transform" and "AI Transform". The first runs the client-side template engine. The second calls the fine-tuned model. Both produce sinc JSON. The AI version reads your actual prompt and generates bands that are specific to your intent — not generic template text.

The cost per call went from $0.002 to $0. Not cheaper — zero. The marginal cost of running AI Transform one million times is the same as running it once: the electricity to spin up the RTX 5090, which is already on. I needed this to work on my hardware, and it does.

This is the right architecture for features that need to run on every user interaction. The investment is in fine-tuning, not in per-call billing. Once the model is trained and deployed, the economics invert completely. I realized that the $0.002-per-call model was never the right shape — I was renting a capability I could own for $0.25 of data generation cost.

See it in action

Try AI Transform on your own prompts — any length, any domain.

Try AI Transform →