The pattern is: identify an AI feature you're currently running via a cloud API, fine-tune a small local model to perform that feature with equal quality, and serve the model from local hardware through an SSH tunnel to your production VPS. The marginal cost per API call drops to zero.
I did this for sinc-LLM prompt decomposition, replacing Claude Haiku API calls at $0.002 each with a local Qwen2.5-7B model. Training took 107 seconds. The model now runs at 290 tokens/second on my RTX 5090 and serves the AI Transform feature on sincllm.com. I wanted to own my inference stack — this is how I got there.
This guide is general — you can apply it to any structured output task: text classification, entity extraction, code formatting, sentiment analysis, schema validation, or anything else where the output space is narrow enough to learn from examples.
A local GPU with at least 8GB VRAM. RTX 3080 works. RTX 5090 runs at 290 tok/s. CPU-only is possible but slow.
Python 3.10+, CUDA toolkit, Ollama, SSH access to a VPS running nginx.
An account with any capable LLM API for teacher data generation. We used Anthropic (Haiku). OpenAI works too.
~2 hours total: 30min data gen, 5min training, 30min tunnel setup, 30min nginx + frontend.
Before generating any training data, write down three things:
Write the validation function first. This is the acceptance criterion for every training example and every inference result. If you can't validate the output programmatically, your invariants aren't precise enough.
Pick a capable teacher model. I used Claude Haiku because it's fast and cheap — $0.002 per call. GPT-4o-mini works too. The teacher's only job is to demonstrate the task on diverse inputs. Quality matters more than quantity: 120 high-quality examples beat 1,000 sloppy ones. I learned this when an early draft with more but noisier examples actually underperformed.
import anthropic, json, time
from pathlib import Path
client = anthropic.Anthropic(api_key="YOUR_KEY")
SYSTEM = """You are a [YOUR TASK] engine. Given any user input,
produce output in this exact format:
[YOUR SCHEMA HERE]
Output ONLY valid JSON/text. No explanation."""
prompts = [
# short inputs
"hello", "fix it", "help",
# medium inputs
"analyze the performance issue in our API",
"write unit tests for the auth module",
# long inputs
"We need to migrate our PostgreSQL database from version 12 to 15...",
# edge cases
"???", "1234", "",
# ... 120 total
]
output_path = Path("training_data.jsonl")
with output_path.open("w") as f:
for i, prompt in enumerate(prompts):
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
system=SYSTEM,
messages=[{"role": "user", "content": prompt}]
)
example = {
"prompt": prompt,
"completion": response.content[0].text
}
f.write(json.dumps(example) + "\n")
print(f"[{i+1}/{len(prompts)}] done")
time.sleep(0.1) # basic rate limit respect
Validate every generated example immediately. If the teacher output fails your validation function, regenerate or manually fix it. Do not include invalid examples in training data — they teach the student to produce invalid output.
Install Unsloth following the official instructions for your CUDA version. The setup takes 5-10 minutes. After that, the training script is under 80 lines.
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import json
# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
# Format training data in ChatML
def format_example(ex):
return (f"<|im_start|>system\n{SYSTEM}\n<|im_end|>\n"
f"<|im_start|>user\n{ex['prompt']}\n<|im_end|>\n"
f"<|im_start|>assistant\n{ex['completion']}\n<|im_end|>")
data = [json.loads(l) for l in open("training_data.jsonl")]
texts = [{"text": format_example(ex)} for ex in data]
from datasets import Dataset
dataset = Dataset.from_list(texts)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
lr_scheduler_type="cosine",
output_dir="./output",
fp16=True,
),
)
trainer.train()
On my RTX 5090, 120 examples at 3 epochs takes about 107 seconds. On an RTX 3080, expect 8-12 minutes. Either way, it's fast enough to iterate on your training set multiple times in a single afternoon. That iteration speed changes how you think about fine-tuning — it stops feeling like a research project and starts feeling like debugging.
# Export to GGUF Q4_K_M
model.save_pretrained_gguf(
"my-model-gguf",
tokenizer,
quantization_method="q4_k_m"
)
# Produces: my-model-gguf/my-model-q4_k_m.gguf (~4.7GB for 7B)
Create a Modelfile and register with Ollama:
FROM ./my-model-q4_k_m.gguf
SYSTEM """
[YOUR SYSTEM PROMPT]
"""
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 2048
ollama create my-model -f Modelfile
ollama run my-model "test input"
Test the model against your validation function on all training examples plus 10-20 held-out prompts. If pass rate is below 80%, check your training data format. Most failures at this stage are ChatML formatting issues, not model quality issues.
On your local machine, create an SSH key pair and add the public key to your VPS's ~/.ssh/authorized_keys. Then run the tunnel:
ssh -N -R 11434:localhost:11434 user@YOUR_VPS_IP \
-i ~/.ssh/vps_key \
-o ServerAliveInterval=30 \
-o ServerAliveCountMax=3 \
-o ExitOnForwardFailure=yes \
-o StrictHostKeyChecking=no
On the VPS, confirm /etc/ssh/sshd_config has:
AllowTcpForwarding yes
GatewayPorts no
Test the tunnel: from the VPS, run curl http://127.0.0.1:11434/api/tags. You should see Ollama's model list from your local machine.
Run the tunnel as a persistent service. On Linux (local machine), create a systemd unit. On Windows, use Task Scheduler or NSSM to wrap the SSH command as a service that starts on boot and restarts on failure.
pip install fastapi uvicorn httpx
# api.py
from fastapi import FastAPI
from fastapi.responses import JSONResponse
import httpx, time, json
app = FastAPI()
@app.post("/ai-feature")
async def ai_feature(request: dict):
prompt = request.get("prompt", "")
t0 = time.time()
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(
"http://127.0.0.1:11434/api/generate",
json={
"model": "my-model",
"prompt": prompt,
"stream": False,
"options": {"temperature": 0.3}
}
)
elapsed = time.time() - t0
result = resp.json()
output = result.get("response", "")
return JSONResponse(
content={"result": json.loads(output)},
headers={"X-Gen-Time": str(round(elapsed, 3))}
)
# Run: uvicorn api:app --host 127.0.0.1 --port 8461
Run uvicorn as a systemd service on the VPS so it restarts automatically.
Add a location block to your nginx config:
location /api/ai-feature {
proxy_pass http://127.0.0.1:8461/ai-feature;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 30s;
proxy_connect_timeout 5s;
# CORS if needed
add_header Access-Control-Allow-Origin *;
add_header Access-Control-Allow-Methods "POST, OPTIONS";
add_header Access-Control-Allow-Headers "Content-Type";
}
nginx -t && nginx -s reload
In your frontend, add the button and call:
async function runAIFeature(prompt) {
const btn = document.getElementById('ai-btn')
btn.disabled = true
btn.textContent = 'Running...'
try {
const res = await fetch('/api/ai-feature', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({prompt})
})
const data = await res.json()
renderResult(data.result)
} catch (e) {
// model offline — show fallback or error
showFallback()
} finally {
btn.disabled = false
btn.textContent = 'AI Feature'
}
}
The $0.24 data generation cost is the entire investment. After that, every call to the AI feature is free. At $0.002 per call (Haiku pricing), break-even is 120 calls. After that, every call is net savings. I hit break-even the first day I deployed.
For high-volume features — anything running more than a few hundred calls per month — this architecture pays for itself within days. The only constraint is that your local GPU must stay on and connected. For features that can tolerate a fallback when the model is offline, this is a fully viable production architecture. I run it this way in production right now.
The pattern generalizes completely. I applied it to prompt decomposition, but the same seven steps work for text classification, named entity extraction, code review, sentiment scoring, or any other narrow structured-output task you currently pay an API to run. The key insight: if you can write a validation function for the output, you can distill a model to produce it.
AI Transform is the live implementation of this exact architecture. Try it on your own prompts.
Try AI Transform →