The simple answer is cost. Running Ollama on local hardware costs zero per inference. The model lives on disk. Cloud APIs charge per token. This feature runs on every prompt. The cost adds up fast. I was paying $0.002 per call. That number kept growing, and I could not justify it.
But cost is not the only reason. The local setup has three properties that matter for this feature:
Latency floor. A cloud API call needs DNS, a TLS handshake, queue time at the provider, and response packaging. A local model call needs none of that. The RTX 5090 is on the same local network as the SSH tunnel endpoint. Round-trip time from the VPS to the model is 10-30ms, not 200-800ms.
Control. A fine-tuned local model gives the same output every time. Temperature 0.3 with a fixed seed gives the same sinc JSON for the same input. Cloud APIs change when the provider updates the model. I do not want my scatter engine to silently break when Anthropic ships a new Haiku version.
No rate limits. My sinc-scatter model can handle as many requests at once as the RTX 5090 VRAM and compute allow. No 429 errors, no tier limits, no burst pricing.
Here is every step a request takes from the browser to the model and back:
Six steps. The slow ones are the SSH tunnel (adds about 15ms compared to local loopback) and Ollama first-token time (the model is already loaded in VRAM, so this is fast, about 50ms). Everything else runs on local loopback, which is nanosecond-scale.
The local machine (RTX 5090 workstation) opens an SSH connection to the VPS. It uses -R (remote port forwarding) to bind port 11434 on the VPS loopback to port 11434 on the local machine loopback.
ssh -N -R 11434:localhost:11434 user@vps-ip \
-o ServerAliveInterval=30 \
-o ServerAliveCountMax=3 \
-o ExitOnForwardFailure=yes
The VPS sshd needs GatewayPorts no and AllowTcpForwarding yes. With GatewayPorts no, the forwarded port only binds on the VPS loopback. Outside clients cannot reach it directly. Only the FastAPI process on the VPS loopback can connect to 127.0.0.1:11434. That connection then forwards to Ollama on the local machine.
I run this as a persistent process that reconnects automatically:
import subprocess
import time
def maintain_tunnel():
while True:
proc = subprocess.Popen([
"ssh", "-N", "-R", "11434:localhost:11434",
"-o", "ServerAliveInterval=30",
"-o", "ServerAliveCountMax=3",
"-o", "StrictHostKeyChecking=no",
"-i", "/path/to/key",
"user@vps-ip"
])
proc.wait()
print(f"Tunnel died (exit {proc.returncode}), reconnecting in 5s")
time.sleep(5)
The tunnel runs under a system service (Windows Task Scheduler on the local machine, or systemd on Linux). If the local machine reboots, the service restarts the tunnel on login. The VPS FastAPI returns 503 while the tunnel is down. The frontend shows an error message in that case.
The FastAPI server is small. It gets a POST with the raw prompt, sends it to Ollama, streams the response back, and adds performance headers. Ollama and the model do all the heavy work.
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import httpx, time, json
app = FastAPI()
OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
MODEL = "sinc-scatter"
@app.post("/scatter")
async def scatter(request: Request):
body = await request.json()
prompt = body.get("prompt", "")
t0 = time.time()
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(OLLAMA_URL, json={
"model": MODEL,
"prompt": prompt,
"stream": False,
"options": {"temperature": 0.3}
})
elapsed = time.time() - t0
result = resp.json()
text = result.get("response", "")
tokens = result.get("eval_count", 0)
tok_per_s = round(tokens / elapsed, 1) if elapsed > 0 else 0
# Log to JSONL
with open("/var/log/scatter-api.jsonl", "a") as f:
f.write(json.dumps({
"ts": t0, "prompt_len": len(prompt),
"tokens": tokens, "elapsed": round(elapsed, 3),
"tok_per_s": tok_per_s
}) + "\n")
return JSONResponse(
content={"sinc": json.loads(text)},
headers={
"X-Tok-Per-S": str(tok_per_s),
"X-Gen-Time": str(round(elapsed, 3)),
"X-Tokens": str(tokens)
}
)
Response headers carry X-Tok-Per-S, X-Gen-Time, and X-Tokens. The frontend can show these as diagnostic data. The JSONL log on the VPS is the permanent record of every call.
I log every request to /var/log/scatter-api.jsonl with timestamp, prompt length, token count, time taken, and tokens per second. This gives a real picture of performance, not just a single benchmark number.
From a sample of test requests:
For a task that fills in structured output, 1.5 seconds is fast enough. Users see a loading state. Then the sinc JSON fills all 6 bands at once.
This setup has one hard requirement: the local machine must be on and connected. If my RTX 5090 workstation is off, the SSH tunnel goes down. AI Transform then returns 503. I handle this in the frontend. The button becomes inactive with a "model offline" tooltip. The standard client-side Transform still works.
There is also a concurrency limit. Ollama default setup handles one generation at a time. On a site with heavy traffic, this would be a bottleneck. At current traffic levels it is not. sinc JSON generation is fast enough that queue time is small. If load grows, the fix is llama.cpp server with --parallel or a batched inference setup. I will address that when needed.
The VPS is stateless. It is just a proxy. If the VPS goes down, nginx returns 502 for the AI Transform endpoint. The VPS can be replaced or restarted in under a minute with no data loss. All state (model weights, logs) lives on my local machine.
What struck me while building this: the setup for zero-cost AI features is surprisingly simple. One capable local GPU, one cheap VPS as the public endpoint, SSH as the bridge. The model runs where the hardware is. The VPS is just the front door. I did not need a cloud inference provider at all.
// Production AI Engineering
sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.
See what we do →