When to Distill a Vendor Model Into a Local One: The Engineering Case for Self-Hosting
Table of Contents
- The Decision Is Not About Capability, It Is About Control
- The Four Signals That Trigger the Distillation Decision
- What Distillation Actually Produces (and What It Does Not)
- The 10-Criterion Build vs Buy Matrix Applied to Self-Hosting
- What the Engineering Setup Looks Like
- When Self-Hosting Is the Wrong Call
- The Decision Framework (Download)
The Decision Is Not About Capability, It Is About Control
Most engineers framing the distillation question ask the wrong first question. They ask: "Is the local model good enough?" That is a second-order question. The first-order question is: "Has the vendor relationship created a control dependency that now creates unacceptable operational risk?" The four signals that answer this question are cost, data residency, update cadence, and in-house ML talent. None of them are capability benchmarks. All of them are engineering constraints that either make the vendor relationship sustainable or make self-hosting the more defensible risk posture.
This article is structured against the real 10-criterion AI Build vs Buy Framework. Four of those criteria are load-bearing for the distillation decision: criterion 3 (data sensitivity and residency), criterion 4 (in-house ML talent), criterion 5 (3-year total cost), and criterion 7 (regulatory and audit). The article maps each of the four signals to those criteria, then shows the full 10-criterion matrix scored for a distillation scenario versus staying vendor. If you do not yet have a spend baseline, the right first step is the 9-Question AI Spend Audit, which surfaces the per-call cost and volume trajectory you need for the TCO comparison in Signal 1.
The Four Signals That Trigger the Distillation Decision
Signal 1. Inference Cost Crosses the Engineering Threshold
The cost argument for self-hosting only holds when the comparison is done correctly. Per-call API cost times projected monthly volume is the numerator. Amortized GPU hardware cost over 36 months plus ML engineer maintenance time is the denominator. If the 36-month TCO favors self-hosting and the payback period is under 12 months, the cost signal is triggered. If the comparison is month-to-month or uses current volume without a projection, it is not a valid cost signal. It is a gut feeling.
Build vs Buy criterion 5 (3-year total cost) is the correct scoring instrument for this signal. If you do not yet have a per-call cost baseline, run the 9-Question AI Spend Audit before attempting the TCO comparison. The audit surfaces idle infrastructure burn, model-tier mismatch, and cache-miss tax that the vendor invoice does not itemize separately.
Signal 2. Data Residency or IP Boundaries Cannot Be Met by the Vendor
This signal is triggered when the vendor cannot credibly answer one or more of these three questions: Where does my data go during inference? Who owns the outputs? Can I guarantee data stays within a specific jurisdiction? For regulated industries (financial services, healthcare, government contracting), these are not hypothetical concerns. They are contractual and legal requirements.
The NIST AI Risk Management Framework (GOVERN function, available at airc.nist.gov/RMF/1) addresses supply-chain risk and vendor dependency controls explicitly. The EU AI Act (Regulation 2024/1689, at eur-lex.europa.eu) sets obligations for third-party provider dependencies in high-risk AI system deployments. ISO/IEC 42001:2023 (at iso.org/standard/81230.html) covers supplier relationship management and AI system lifecycle requirements relevant to model provenance and update control. None of these frameworks tell you to self-host. They require you to have a documented control posture for the vendor dependency. If the vendor cannot satisfy criterion 3 (data sensitivity and residency) in the Build vs Buy Framework, self-hosting is the engineering path, not a preference.
Output IP ownership is a separate, underappreciated dimension. Some vendor terms leave the IP ownership of model outputs ambiguous or subject to the vendor's own use. For companies whose product value is in the outputs (generated content, structured reports, trained classifiers), this is a material business risk, not a legal footnote.
Signal 3. The Vendor Update Cadence Is Breaking Your Outputs
This is the most underappreciated signal. Vendor model updates change token distributions without notice. A structured-output consumer, a JSON parser, a named-entity extractor, or a downstream classifier that was calibrated against one model version will break silently when the vendor deploys a new checkpoint. The breakage is not always immediate. It can be gradual: outputs that were 98% parseable drop to 87%, surfacing as a support ticket three weeks later.
Self-hosting gives you control over the update cadence. You gate updates behind an eval suite. You roll back if the eval regresses. This maps directly to criterion 7 (regulatory and audit) in the Build vs Buy Framework and to audit criterion 7 (model-update cadence and rollback) from the 10-Point AI Vendor Audit. If your downstream system is sensitive to token distribution shifts, losing rollback control is an operational risk, not an inconvenience.
Signal 4. In-House ML Talent Can Absorb the Maintenance Burden
This signal is necessary but not sufficient on its own. A distilled 7B model requires engineers who can re-tune when the task distribution drifts, maintain eval coverage, and run rollback paths. "We can hire for it" is not a yes answer. The question is whether a named engineer on the current team has run fine-tuning workflows before and has bandwidth allocated for ongoing maintenance. Build vs Buy criterion 4 (in-house ML talent) is the correct scoring instrument. If it scores "Favors Vendor," the distillation path is not viable regardless of how cost and data signals score.
The maintenance burden has three specific dimensions: prompt re-tuning when the task distribution drifts, eval coverage that catches regressions before they reach production, and documented rollback paths back to the vendor API during a failure event. A self-hosted model without these three is more fragile than the vendor dependency it replaced.
Is your AI spend producing measurable outcomes, or just activity?
The AI Cost Reality Check asks 9 procurement-level questions: cost per resolved task, idle infrastructure burn, vendor concentration premium, shadow AI exposure, and hallucination rework cost. Free PDF, 15 minutes per quarter.
→ Get the AI Cost Reality CheckWhat Distillation Actually Produces (and What It Does Not)
Distillation produces a model that handles the task distribution it was trained on. It does not produce a frontier model replacement. This distinction matters for the decision. If your use case is a well-bounded structured-output task (classification, entity extraction, templated generation, schema-constrained JSON), a distilled 7B model can handle it with latency and cost profiles that a vendor API cannot match at volume. If your use case requires broad reasoning, cross-domain generalization, or frequent task-type changes, a distilled model will regress on the cases it was not trained on.
The evidence from sincllm's documented distillation of Claude Haiku into Qwen-7B illustrates this boundary. The distillation targeted a specific structured-output task class. It did not attempt to replace frontier reasoning or cross-domain generalization. That is the correct scope for a 7B distillation project. Similarly, the project documenting replacing a vendor API with a fine-tuned 7B model in production demonstrates that production replacement at the 7B scale is viable when the task distribution is well-defined and the eval coverage is in place before cutover. Neither project claims equivalence to the frontier model. Both demonstrate production viability within a bounded task scope.
The honest engineering framing is: distillation is a narrowing, not a cloning. You are not replacing the vendor model. You are replacing the vendor's handling of a specific task distribution with a model you control that handles that distribution with comparable or better accuracy, at lower cost, with no external dependency. The residual use case for the vendor API (the cases outside the task distribution) either stays vendor or is explicitly out of scope.
The 10-Criterion Build vs Buy Matrix Applied to Self-Hosting
The following table maps all 10 criteria from the AI Build vs Buy Framework to how they typically score for a distillation decision versus staying vendor. Four criteria are load-bearing toward self-hosting. Four typically favor staying vendor. Two are context-dependent. The table shows the shape of the decision; the full scored matrix for your specific system is in the downloadable framework.
| Criterion | Stay Vendor | Distill + Self-Host | Load-Bearing? | Notes |
|---|---|---|---|---|
| C1: Time-to-value horizon | Favors Vendor | Disfavors | No | Distillation adds 3-9 months before production cutover. If time-to-value is under 6 months, stay vendor. |
| C2: Strategic differentiation | Context-Dependent | Context-Dependent | No | Self-hosting adds differentiation only if model behavior is a core product feature. |
| C3: Data sensitivity and residency | Disfavors (if constrained) | Favors Distillation | Yes | Load-bearing: if data cannot leave jurisdiction or vendor terms do not cleanly bound PII, self-hosting is the only path. |
| C4: In-house ML talent | Neutral | Favors Distillation (if talent present) | Yes | Load-bearing: no named ML engineer with fine-tuning experience means distillation is not viable regardless of other signals. |
| C5: 3-year total cost | Favors Vendor (low volume) | Favors Distillation (high volume) | Yes | Load-bearing: requires real TCO comparison with GPU amortization, not a month-to-month invoice comparison. |
| C6: Vendor lock-in tolerance | Context-Dependent | Context-Dependent | No | Lock-in tolerance is a business decision, not an engineering one. Score it honestly against your contract terms. |
| C7: Regulatory and audit | Disfavors (if audited) | Favors Distillation | Yes | Load-bearing: model provenance, update control, and rollback audit trails are materially easier to demonstrate on a self-hosted system. |
| C8: Integration depth | Favors Vendor | Disfavors | No | Deep integrations with vendor tooling (fine-tuning APIs, evals platforms, function calling) need to be re-implemented locally. Significant engineering cost. |
| C9: Iteration cadence | Favors Vendor | Disfavors | No | Rapid task-type changes favor staying vendor. A distilled model requires re-training when the task distribution changes significantly. |
| C10: Failure-mode visibility | Context-Dependent | Context-Dependent | No | Self-hosting gives full observability into model behavior. Vendor APIs give limited visibility. Depends on whether you have the infrastructure to exploit full observability. |
Four load-bearing criteria (C3, C4, C5, C7) determine whether distillation is viable. If any of the first three score "Favors Vendor" and the fourth (ML talent) scores "not viable," the distillation path is not ready regardless of what the other six criteria say.
Build in-house or buy a platform? Use the framework before you decide.
The Build vs Buy Framework scores 10 criteria across time-to-value, data residency, total 3-year cost, and vendor lock-in tolerance. One-page decision matrix. Free PDF, usable in any board presentation.
→ Download the AI Build vs Buy FrameworkWhat the Engineering Setup Looks Like
This section is a pre-commitment checklist, not a tutorial. Before starting a distillation project, three engineering decisions need to be resolved.
Inference stack. Self-hosting runs on either local GPU hardware or a cloud GPU instance. Local hardware has lower per-token cost at steady volume but requires hardware procurement, maintenance, and failure planning. A cloud GPU instance (on-demand or reserved) has higher per-token cost but lower operational overhead and shorter time-to-first-inference. For an initial distillation proof, a cloud GPU instance is typically the right starting point. The evidence from running a local LLM as a production website backend demonstrates that local inference is viable in production, but the infrastructure decisions are non-trivial and need to match your latency SLO before cutover.
Eval coverage before cutover. A distilled model going to production without an eval gate is the primary failure mode of self-hosting projects. The eval suite needs to cover the full task distribution the distilled model will handle, with pass/fail thresholds set before any inference traffic is routed. Incident Readiness control 8 (eval coverage) from the 12-Control Audit is the correct checklist for this gate. Do not cut over until the eval suite passes at your production quality threshold.
Rollback path. The vendor API must remain live as a fallback during the transition period and for the first 90 days of production operation. The rollback path is not an emergency measure: it is a planned operational state. The routing logic (percent of traffic to distilled model, percent to vendor API) should be controlled by a feature flag, not a code deploy. This mirrors the kill-switch and rollback controls in incident readiness planning.
When Self-Hosting Is the Wrong Call
Three cases make self-hosting the wrong engineering decision, regardless of how the cost or data signals score.
No in-house ML talent. If no named engineer on the team has run fine-tuning or distillation workflows before, the project will stall at the training phase or produce a model that cannot be maintained when the task distribution drifts. Build vs Buy criterion 4 is not optional. "We will hire for it" is a project risk, not a capability.
Payback period exceeds 12 months. A distillation project that does not reach cost-parity with the vendor API within 12 months of production cutover is a net-negative investment in most cases. The ML engineer time, hardware procurement or cloud GPU spend, and eval infrastructure represent a significant upfront cost. If the volume does not justify that cost within 12 months, the TCO comparison does not support the decision.
Task distribution is too broad for fine-tuning scope. A distilled 7B model works when the task distribution is stable and bounded. If the product requires the model to handle a wide variety of task types, domain switches, or prompt patterns that change frequently, a distilled model will require constant re-training. At that point, the maintenance burden exceeds the cost and control benefits of self-hosting. Stay vendor and invest in better API cost management and contract terms instead.
The cost of getting this wrong is not just a failed project. A brittle self-hosted model with no fallback path and no eval coverage is worse than the vendor dependency it replaced. You have added hardware operational risk, lost the vendor's model quality improvements, and created a system that breaks silently when the task distribution drifts. The vendor dependency at least had a defined failure mode and a support contract.
The Decision Framework (Download)
The four signals and the 10-criterion matrix in this article give you the shape of the decision. Scoring your own system against all 10 criteria requires the full framework, which includes the complete scoring rubric, the load-bearing criteria thresholds, and a one-page matrix usable in a board presentation. If you are at the threshold and want a production engineer to pressure-test the decision before committing resources, you can book a 30-minute production review after you download the framework.
Build in-house or buy a platform? Use the framework before you decide.
The Build vs Buy Framework scores 10 criteria across time-to-value, data residency, total 3-year cost, and vendor lock-in tolerance. One-page decision matrix. Free PDF, usable in any board presentation.
→ Download the AI Build vs Buy Framework