Here is the idea. Find an AI feature you pay a cloud API to run. Train a small local model to do the same job just as well. Then serve that model from your own computer through an SSH tunnel to your web server. The cost of each API call drops to zero.
I did this for sinc-LLM prompt decomposition. I replaced Claude Haiku API calls at $0.002 each with a local Qwen2.5-7B model. Training took 107 seconds. The model now runs at 290 tokens/second on my RTX 5090 and powers the AI Transform feature on sincllm.com. I wanted to own my inference stack. This is how I got there.
This guide works for any task where the output has a clear shape: text classification, entity extraction, code formatting, sentiment analysis, schema validation, or anything else where the output is narrow enough to learn from examples.
A local GPU with at least 8 GB VRAM. An RTX 3080 works. An RTX 5090 runs at 290 tok/s. CPU-only is possible but slow.
Python 3.10 or newer, the CUDA toolkit, Ollama, and SSH access to a VPS running nginx.
An account with any capable LLM API to generate teacher data. We used Anthropic (Haiku). OpenAI works too.
About 2 hours total: 30 min for data, 5 min for training, 30 min for the tunnel, 30 min for nginx and the frontend.
Before you generate any training data, write down three things.
Write the validation function first. It is the pass/fail test for every training example and every inference result. If you cannot validate the output with code, your invariants are not precise enough.
Pick a capable teacher model. I used Claude Haiku because it is fast and cheap at $0.002 per call. GPT-4o-mini works too. The teacher's only job is to show the task on many different inputs. Quality beats quantity: 120 good examples beat 1,000 messy ones. I learned this when an early draft with more but noisier examples actually did worse.
import anthropic, json, time
from pathlib import Path
client = anthropic.Anthropic(api_key="YOUR_KEY")
SYSTEM = """You are a [YOUR TASK] engine. Given any user input,
produce output in this exact format:
[YOUR SCHEMA HERE]
Output ONLY valid JSON/text. No explanation."""
prompts = [
# short inputs
"hello", "fix it", "help",
# medium inputs
"analyze the performance issue in our API",
"write unit tests for the auth module",
# long inputs
"We need to migrate our PostgreSQL database from version 12 to 15...",
# edge cases
"???", "1234", "",
# ... 120 total
]
output_path = Path("training_data.jsonl")
with output_path.open("w") as f:
for i, prompt in enumerate(prompts):
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
system=SYSTEM,
messages=[{"role": "user", "content": prompt}]
)
example = {
"prompt": prompt,
"completion": response.content[0].text
}
f.write(json.dumps(example) + "\n")
print(f"[{i+1}/{len(prompts)}] done")
time.sleep(0.1) # basic rate limit respect
Check every example right after you generate it. If the teacher output fails your validation function, regenerate it or fix it by hand. Never put invalid examples in your training data. Invalid examples teach the student model to produce invalid output.
Install Unsloth using the official instructions for your CUDA version. Setup takes 5 to 10 minutes. After that, the training script is under 80 lines.
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import json
# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
# Format training data in ChatML
def format_example(ex):
return (f"<|im_start|>system\n{SYSTEM}\n<|im_end|>\n"
f"<|im_start|>user\n{ex['prompt']}\n<|im_end|>\n"
f"<|im_start|>assistant\n{ex['completion']}\n<|im_end|>")
data = [json.loads(l) for l in open("training_data.jsonl")]
texts = [{"text": format_example(ex)} for ex in data]
from datasets import Dataset
dataset = Dataset.from_list(texts)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
lr_scheduler_type="cosine",
output_dir="./output",
fp16=True,
),
)
trainer.train()
On my RTX 5090, 120 examples at 3 epochs takes about 107 seconds. On an RTX 3080, expect 8 to 12 minutes. Either way, it is fast enough to try your training set many times in one afternoon. That speed changes how fine-tuning feels. It stops being a research project and starts being more like debugging.
# Export to GGUF Q4_K_M
model.save_pretrained_gguf(
"my-model-gguf",
tokenizer,
quantization_method="q4_k_m"
)
# Produces: my-model-gguf/my-model-q4_k_m.gguf (~4.7GB for 7B)
Create a Modelfile and register it with Ollama:
FROM ./my-model-q4_k_m.gguf
SYSTEM """
[YOUR SYSTEM PROMPT]
"""
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 2048
ollama create my-model -f Modelfile
ollama run my-model "test input"
Test the model with your validation function on all training examples plus 10 to 20 inputs it has never seen. If the pass rate is below 80%, check your training data format. Most failures at this stage are ChatML formatting problems, not model quality problems.
On your local machine, create an SSH key pair. Add the public key to your VPS's ~/.ssh/authorized_keys. Then run the tunnel:
ssh -N -R 11434:localhost:11434 user@YOUR_VPS_IP \
-i ~/.ssh/vps_key \
-o ServerAliveInterval=30 \
-o ServerAliveCountMax=3 \
-o ExitOnForwardFailure=yes \
-o StrictHostKeyChecking=no
On the VPS, make sure /etc/ssh/sshd_config contains:
AllowTcpForwarding yes
GatewayPorts no
Test the tunnel. From the VPS, run curl http://127.0.0.1:11434/api/tags. You should see Ollama's model list from your local machine.
Run the tunnel as a persistent service. On Linux (local machine), create a systemd unit. On Windows, use Task Scheduler or NSSM to wrap the SSH command as a service. Set it to start on boot and restart on failure.
pip install fastapi uvicorn httpx
# api.py
from fastapi import FastAPI
from fastapi.responses import JSONResponse
import httpx, time, json
app = FastAPI()
@app.post("/ai-feature")
async def ai_feature(request: dict):
prompt = request.get("prompt", "")
t0 = time.time()
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(
"http://127.0.0.1:11434/api/generate",
json={
"model": "my-model",
"prompt": prompt,
"stream": False,
"options": {"temperature": 0.3}
}
)
elapsed = time.time() - t0
result = resp.json()
output = result.get("response", "")
return JSONResponse(
content={"result": json.loads(output)},
headers={"X-Gen-Time": str(round(elapsed, 3))}
)
# Run: uvicorn api:app --host 127.0.0.1 --port 8461
Run uvicorn as a systemd service on the VPS so it restarts on its own.
Add a location block to your nginx config file:
location /api/ai-feature {
proxy_pass http://127.0.0.1:8461/ai-feature;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 30s;
proxy_connect_timeout 5s;
# CORS if needed
add_header Access-Control-Allow-Origin *;
add_header Access-Control-Allow-Methods "POST, OPTIONS";
add_header Access-Control-Allow-Headers "Content-Type";
}
nginx -t && nginx -s reload
In your frontend, add the button and the fetch call:
async function runAIFeature(prompt) {
const btn = document.getElementById('ai-btn')
btn.disabled = true
btn.textContent = 'Running...'
try {
const res = await fetch('/api/ai-feature', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({prompt})
})
const data = await res.json()
renderResult(data.result)
} catch (e) {
// model offline — show fallback or error
showFallback()
} finally {
btn.disabled = false
btn.textContent = 'AI Feature'
}
}
The $0.24 data cost is the whole investment. After that, every call to the AI feature is free. At $0.002 per call (Haiku pricing), you break even after 120 calls. Every call after that is pure savings. I hit break-even on the first day I deployed.
For high-volume features (anything running more than a few hundred calls per month), this setup pays for itself within days. The only constraint is that your local GPU must stay on and connected. If your feature can show a fallback when the model is offline, this is a fully workable production setup. I run it this way in production right now.
The pattern works for any task. I applied it to prompt decomposition, but the same seven steps work for text classification, named entity extraction, code review, sentiment scoring, or any other narrow structured-output task you currently pay an API to run. The key insight is simple: if you can write a validation function for the output, you can train a model to produce it.
// Production AI Engineering
sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.
See what we do →