We Fine-Tuned a 7B Model in 107 Seconds to Replace a Cloud API

March 25, 2025 · 8 min read · fine-tuning unsloth local-llm sinc-llm ollama

The problem with per-call API costs
Teacher-student distillation
Training setup: Unsloth + LoRA
107 seconds on an RTX 5090
GGUF export and Ollama registration
Results: 9/10 pass, 290 tok/s
What changed for users

x(t) = Σ x(nT) · sinc((t − nT) / T)

The sinc-LLM framework applies Nyquist-Shannon sampling to prompt engineering. Each band is a frequency sample of intent.

The problem with per-call API costs

Every time someone clicks "Transform" on sincllm.com, I break their raw prompt into 6 sinc frequency bands: PERSONA, CONTEXT, DATA, CONSTRAINTS, FORMAT, and TASK. That step was calling Claude Haiku at $0.002 per call, and the cost was adding up.

Two dollars per thousand calls. That was fine at low volume. But it was not the right setup for a feature that should be fast and free to run. I wanted to own my own inference stack, not pay per token.

The output is always structured the same way: JSON with specific band names, specific length rules, and one fixed rule (CONSTRAINTS must always be the longest band). I realized a small fine-tuned model can match or beat a big general-purpose API on a job like this, because the answer space is narrow and the training signal is clean.

I decided to distill Haiku into a 7B local model, cut the API call entirely, and ship the result as "AI Transform" on sincllm.com.

$0.002 Previous cost per transform (Haiku API)

$0.000 Current cost per transform (local model)

107s Total fine-tuning time

290 Tokens/sec at inference

Teacher-student distillation

The idea is simple. Use a good teacher model (Haiku) to make high-quality training examples. Then fine-tune a smaller student model (Qwen2.5-7B-Instruct) on those examples. The student learns to copy the teacher's output format without needing the teacher at runtime.

This works well when the task has three properties:

Structured output. The target format is always the same: sinc JSON with 6 named bands. The model is not asked to be creative about the format.
Compressible knowledge. The rules for how to fill each band (PERSONA gets a role description, CONSTRAINTS gets the longest text, TASK is the atomic action) fit easily in a few thousand tokens of system prompt or training signal.
Input diversity. The training set covers the full range of real inputs, from single words to full paragraphs, so the model generalizes instead of memorizing.

I made 120 training examples using Haiku. Each example is one (prompt, sinc JSON) pair. Prompts range from the word "hi" (1 word) to 200-word specification documents. Edge cases include "???" (non-linguistic input), "asdf" (gibberish), and short commands like "fix line 3".

Training setup: Unsloth + LoRA

I used Unsloth for training. Unsloth patches the HuggingFace training loop with hand-written CUDA kernels and memory optimizations. It cuts VRAM usage roughly in half compared to plain transformers fine-tuning and runs 2 to 5 times faster. For a 7B model on a single GPU, that matters.

LoRA (Low-Rank Adaptation) means I do not retrain all 7 billion parameters. Instead, small low-rank matrices are added inside the attention layers, and only those are trained. The base model weights stay frozen. This makes training fast and keeps the GGUF export small.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj",
                    "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

Training data is in ChatML format, which is what Qwen2.5-Instruct expects at inference time. Each example becomes one user/assistant turn pair, with a system prompt explaining the sinc decomposition task.

<|im_start|>system
You are a sinc-LLM scatter engine. Decompose the user prompt into
6 sinc frequency bands as valid JSON...
<|im_end|>
<|im_start|>user
we need to fix the auth bug before the demo
<|im_end|>
<|im_start|>assistant
{
  "formula": "x(t) = Σ x(nT) · sinc((t − nT) / T)",
  "T": "specification-axis",
  "fragments": [
    {"n": 0, "t": "PERSONA", "x": "Senior software engineer..."},
    {"n": 1, "t": "CONTEXT", "x": "Pre-demo fix context..."},
    ...
  ]
}
<|im_end|>

107 seconds on an RTX 5090

Training ran on a local RTX 5090. I used 120 examples, 3 epochs, and a batch size of 2 with gradient accumulation of 4. The effective batch size was 8. The learning rate was 2e-4 with cosine decay.

When I saw the loss drop from 2.24 to 1.14, I knew the model was learning the structure, not just memorizing. That is a clean convergence curve for a structured output task. The model is learning a narrow answer space, not a broad one.

Step   1/45  | Loss: 2.2418
Step  10/45  | Loss: 1.8832
Step  20/45  | Loss: 1.5210
Step  30/45  | Loss: 1.3104
Step  40/45  | Loss: 1.1821
Step  45/45  | Loss: 1.1403
Training complete. Time: 107.3s

107 seconds. I did not expect 107 seconds to be enough. That kind of speed makes fine-tuning feel more like setting a config than doing research. I could re-run this with 200 examples or different hyperparameters in under 3 minutes total, including data prep.

GGUF export and Ollama registration

After training, I merged the LoRA adapter back into the base model weights and exported to GGUF format using llama.cpp's quantization toolchain. I used Q4_K_M quantization: 4-bit with mixed precision on the most sensitive layers.

# Merge LoRA into base weights
model.save_pretrained_merged("sinc-scatter-merged", tokenizer)

# Export to GGUF Q4_K_M
model.save_pretrained_gguf(
    "sinc-scatter-gguf",
    tokenizer,
    quantization_method="q4_k_m"
)

The resulting file is 4.7 GB. Q4_K_M gives a good balance between quality and file size. The quality drop (perplexity) is small for structured output tasks, because the set of valid tokens is very limited.

Registering with Ollama takes a one-file Modelfile:

FROM ./sinc-scatter-q4_k_m.gguf

SYSTEM """
You are a sinc-LLM scatter engine. Your sole function is to decompose
any raw user prompt into a sinc JSON object with formula, T, and a
fragments array containing exactly 6 bands (n=0..5):
PERSONA, CONTEXT, DATA, CONSTRAINTS, FORMAT, TASK.
CONSTRAINTS must always be the longest band.
Output only valid JSON. No explanation, no preamble.
"""

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 2048

ollama create sinc-scatter -f Modelfile
# → success. Model registered as 'sinc-scatter'

ollama run sinc-scatter "fix the login bug"
# → {"formula": "x(t) = Σ...", "T": "specification-axis", "fragments": [...]}

Results: 9/10 pass, 290 tok/s

I ran 10 test prompts covering the full range of real inputs. 9 passed on the first try with no post-processing. 1 failed on a tricky edge case ("???"): the model gave valid JSON but left out the formula field. I added a fallback in the API server to inject the formula if it is missing. All 10 now pass.

Generation speed: 290 tokens per second on the RTX 5090. A typical sinc JSON output is 350 to 500 tokens. That is a 1.2 to 1.7 second round trip from request to complete JSON, including the HTTP overhead through the SSH tunnel.

What surprised me most: the CONSTRAINTS band rule holds across all 10 test cases with no post-processing. CONSTRAINTS is always the longest band, just as specified. The model has learned the rule, not just copied it from examples. That is when it clicked: this is not template filling. It is real generalization.

What changed for users

On sincllm.com there are now two buttons: "Transform" and "AI Transform". The first runs the client-side template engine. The second calls the fine-tuned model. Both produce sinc JSON. The AI version reads your actual prompt and generates bands that match your specific intent, not generic template text.

The cost per call went from $0.002 to $0. Not cheaper: zero. Running AI Transform one million times costs the same as running it once: the electricity to keep the RTX 5090 on, which is already on. I needed this to work on my own hardware, and it does.

This is the right setup for features that need to run on every user action. You invest in fine-tuning once, not in per-call billing forever. Once the model is trained and running, the economics flip completely. I realized the $0.002-per-call model was never the right shape: I was renting a capability I could own for $0.25 of data generation cost.

// Production AI Engineering

Build AI systems that hold up in production.

sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.

See what we do →