Distilling Claude Haiku Into a 7B Model for Structured Output

March 25, 2025 · 8 min read · distillation fine-tuning unsloth sinc-llm

Why distillation works for this task
Generating 120 training examples
Input diversity: from "hi" to paragraphs
ChatML format and training setup
What the model must learn: the invariants
Edge cases: "???", "asdf", empty strings
Validation: 9/10 pass, 1 patched

x(t) = Σ x(nT) · sinc((t − nT) / T)

The sinc-LLM formula is the target the student model must learn to apply. Every training example is a reconstruction of this formula applied to a raw prompt.

Why distillation works for this task

Knowledge distillation means training a small model to copy a big model. The big model is called the teacher. It makes training data. The small model is called the student. It learns from that data. The student does not need to figure things out from scratch. It just needs to learn the pattern the teacher showed it.

This works well when the teacher's task is clear and the output format is simple. Sinc JSON decomposition is exactly that kind of task:

The output shape is always the same: a JSON object with formula, T, and a fragments array of exactly 6 objects.
Each fragment has exactly 3 keys: n (integer 0-5), t (band name), x (text content).
The band names are fixed: PERSONA, CONTEXT, DATA, CONSTRAINTS, FORMAT, TASK.
One hard rule: fragments[3].x (CONSTRAINTS) must be the longest band by character count.

The teacher (Haiku) got the sinc-LLM spec and broke down many different prompts. I then trained Qwen2.5-7B on those (prompt, JSON) pairs. The student does not need to know why CONSTRAINTS must be the longest band. It just needs to see that pattern enough times to remember it.

120 Training examples generated by Haiku

3 Training epochs

2.24 Initial training loss

1.14 Final training loss

Generating 120 training examples

I wrote a script that calls the Haiku API with a fixed system prompt about sinc-LLM. Each call sends one raw prompt and gets back a sinc JSON output. 120 calls at $0.002 each equals $0.24 total. My whole training dataset cost less than a cup of coffee.

import anthropic

client = anthropic.Anthropic()
SYSTEM = """You are a sinc-LLM scatter engine. Given any raw user prompt,
decompose it into a sinc JSON object with this exact structure:
{
  "formula": "x(t) = Σ x(nT) · sinc((t − nT) / T)",
  "T": "specification-axis",
  "fragments": [
    {"n": 0, "t": "PERSONA",      "x": "..."},
    {"n": 1, "t": "CONTEXT",      "x": "..."},
    {"n": 2, "t": "DATA",         "x": "..."},
    {"n": 3, "t": "CONSTRAINTS",  "x": "..."},  // MUST be longest band
    {"n": 4, "t": "FORMAT",       "x": "..."},
    {"n": 5, "t": "TASK",         "x": "..."}
  ]
}
Output ONLY valid JSON. No preamble, no explanation."""

def generate_example(prompt: str) -> dict:
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        system=SYSTEM,
        messages=[{"role": "user", "content": prompt}]
    )
    return {"prompt": prompt, "completion": msg.content[0].text}

Each example is saved to training_data.jsonl, one JSON object per line. I checked every example before using it. I parsed the JSON, made sure all 6 bands were there, and confirmed CONSTRAINTS was the longest band.

Input diversity: from "hi" to paragraphs

I picked 120 prompts to cover the full range of real user inputs. This was the most important part of the process. I learned this lesson the hard way. My first draft used too many technical prompts. A model trained only on long technical prompts will fail on one-word inputs. A model trained only on casual prompts will do poorly on detailed specs.

Here is the breakdown of prompts I used:

Single words (15 examples): "hi", "help", "go", "fix", "write", "analyze", "optimize", "debug", "deploy", "review", "explain", "build", "test", "refactor", "migrate".
Short commands 2-5 words (25 examples): "fix the bug", "write a test", "deploy to production", "review my code", and similar phrases.
Single sentences (30 examples): Full command sentences across technical, business, creative, and personal topics.
Paragraph-length (30 examples): 50-200 word specs, feature requests, and bug reports.
Domain-specific (10 examples): SQL queries, code snippets, and JSON configs used as prompts.
Edge cases (10 examples): "???", "asdf", random numbers, emoji strings, and mixed languages.

"hi" Minimum input — 2 characters. Tests whether the model can still generate a valid, non-trivial CONSTRAINTS band.

"fix line 3" Under-specified task. Tests domain inference and how the model handles missing context.

"???" Non-linguistic input. Tests graceful handling — the model should still produce valid JSON.

"asdf" Gibberish. Tests whether the model defaults to a reasonable interpretation rather than producing malformed output.

ChatML format and training setup

Qwen2.5-Instruct uses ChatML as its native format. Each training example must be a complete conversation turn. That means a system prompt, a user message, and an assistant response, all using the ChatML special tokens. Getting this right matters. Using a different format will train the model to respond in a way that does not match how it runs at inference time.

def format_example(prompt: str, completion: str) -> str:
    return (
        f"<|im_start|>system\n"
        f"You are a sinc-LLM scatter engine. Decompose the user prompt "
        f"into 6 sinc frequency bands as valid JSON. CONSTRAINTS must be "
        f"the longest band. Output only JSON.\n"
        f"<|im_end|>\n"
        f"<|im_start|>user\n"
        f"{prompt}\n"
        f"<|im_end|>\n"
        f"<|im_start|>assistant\n"
        f"{completion}\n"
        f"<|im_end|>"
    )

Training settings: 3 epochs, batch size 2, gradient accumulation 4 (effective batch 8), learning rate 2e-4, cosine decay schedule, warmup steps 5. LoRA rank 16, alpha 16, targeting all 7 projection matrices in the attention and MLP blocks.

What the model must learn: the invariants

Three hard rules define a valid sinc JSON, beyond the basic format:

Invariant 1: Exactly 6 bands, always. n=0 through n=5, in order, no duplicates, no missing entries. The model must output all 6 even when the input gives no information about most bands.

Invariant 2: CONSTRAINTS (n=3) is always the longest band. By character count, fragments[3].x must be longer than every other band's x value. This rule is broken most often when base models try to do this task without any training examples.

Invariant 3: TASK (n=5) is exactly one sentence. Not a paragraph. Not a list. One short, direct sentence saying what to do. The model must not add extra caveats or side tasks.

Every Haiku-generated training example follows all three rules. After 3 epochs of training, the student model passes the rule checks on 9/10 validation prompts without any post-processing. The surprising part: Invariant 2 (CONSTRAINTS always longest) was the hardest rule for base models to follow on their own, but it was the easiest rule to teach through distillation.

Edge cases: "???", "asdf", empty strings

Edge cases are where fine-tuned models often break down. A model trained only on normal prompts will produce bad JSON or an error when it sees "???" as input.

I put 10 edge case examples in the training set on purpose. Here is how Haiku handled "???" and what the student model learned from it:

{
  "formula": "x(t) = Σ x(nT) · sinc((t − nT) / T)",
  "T": "specification-axis",
  "fragments": [
    {"n": 0, "t": "PERSONA", "x": "A clarification-seeking assistant prepared to resolve ambiguity before proceeding."},
    {"n": 1, "t": "CONTEXT", "x": "The user has submitted a non-linguistic input consisting only of question marks, indicating confusion, a request for clarification, or a test of the system boundary."},
    {"n": 2, "t": "DATA",    "x": "Input received: '???'. No additional context. No prior conversation history."},
    {"n": 3, "t": "CONSTRAINTS", "x": "Do not invent intent where none is present. Do not assume the user wants any specific action. The response must acknowledge the ambiguity explicitly. Do not ask multiple clarifying questions at once — ask the single most useful one. Do not produce a complex or lengthy response. Do not fail silently or produce an empty output. The system must handle this gracefully and return a useful response regardless of input quality."},
    {"n": 4, "t": "FORMAT",  "x": "A single short question or prompt to the user requesting clarification."},
    {"n": 5, "t": "TASK",    "x": "Ask the user what they need help with."}
  ]
}

The key insight: CONSTRAINTS is still the longest band. It is still specific to the edge case. The model did not produce empty bands or refuse to output JSON. Training on these edge cases teaches the student to handle strange inputs without breaking.

Validation: 9/10 pass, 1 patched

After training, I ran 10 held-out prompts through the model and checked each output against the three rules:

def validate_sinc_json(output: str) -> tuple[bool, str]:
    try:
        data = json.loads(output)
    except json.JSONDecodeError as e:
        return False, f"Invalid JSON: {e}"

    fragments = data.get("fragments", [])
    if len(fragments) != 6:
        return False, f"Expected 6 bands, got {len(fragments)}"

    band_names = [f["t"] for f in fragments]
    expected = ["PERSONA","CONTEXT","DATA","CONSTRAINTS","FORMAT","TASK"]
    if band_names != expected:
        return False, f"Band order mismatch: {band_names}"

    # CONSTRAINTS invariant
    lengths = {f["t"]: len(f["x"]) for f in fragments}
    if not all(lengths["CONSTRAINTS"] > lengths[k]
               for k in lengths if k != "CONSTRAINTS"):
        return False, "CONSTRAINTS band not longest"

    if "formula" not in data:
        return False, "Missing formula field"

    return True, "PASS"

Results: 9 passed, 1 failed. The failure was on the "???" edge case. The model made valid JSON with all 6 bands but left out the top-level formula field. Every other rule passed. I added a one-line patch in the API server to inject the formula when it is missing:

if "formula" not in result:
    result["formula"] = "x(t) = Σ x(nT) · sinc((t − nT) / T)"

With the patch, all 10 validation prompts pass. The model has learned the CONSTRAINTS length rule correctly. That was the hardest rule to train for. It holds in 10/10 cases. Getting 9/10 before the patch is a strong result for 107 seconds of training on 120 examples. I did not expect to get there in one shot.

// Production AI Engineering

Build AI systems that hold up in production.

sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.

See what we do →