The obvious answer is cost. Running Ollama on local hardware with a model that lives on disk costs zero per inference. Cloud APIs charge per token. For a feature that runs on every prompt decomposition, the per-call cost compounds fast. I was paying $0.002 per call and it was adding up to something I couldn't justify long-term.
But cost isn't the only reason. I realized there are three properties of the local setup that matter specifically for this feature:
Latency floor. A cloud API call involves DNS, TLS handshake, queue time at the provider's infrastructure, and response serialization. A local model call involves none of that. The RTX 5090 is on the same LAN as the SSH tunnel endpoint. Round-trip latency from the VPS to the model is 10-30ms, not 200-800ms.
Control. A fine-tuned local model produces deterministic output. Temperature 0.3 with a fixed seed gives the same sinc JSON for the same input every time. Cloud APIs drift with model updates. I don't want my scatter engine's output format to silently change when Anthropic ships a new Haiku version.
No rate limits. My sinc-scatter model can serve as many concurrent requests as the RTX 5090's VRAM and compute allows. No 429s, no tier limits, no burst pricing.
Here is every hop a request takes from browser to model and back:
Six hops. The slow ones are the SSH tunnel (adds ~15ms of latency relative to local loopback) and Ollama's first-token latency (the model is pre-loaded in VRAM so this is fast — ~50ms). Everything else is local loopback, which is nanosecond-scale.
The local machine (RTX 5090 workstation) initiates an SSH connection to the VPS. It uses -R (remote port forwarding) to bind port 11434 on the VPS's loopback interface to port 11434 on the local machine's loopback interface.
ssh -N -R 11434:localhost:11434 user@vps-ip \
-o ServerAliveInterval=30 \
-o ServerAliveCountMax=3 \
-o ExitOnForwardFailure=yes
The VPS's sshd must have GatewayPorts no and AllowTcpForwarding yes. With GatewayPorts no, the forwarded port only binds on the VPS's loopback — external clients can't reach it directly. Only the FastAPI process running on the VPS loopback can connect to 127.0.0.1:11434, which then forwards to Ollama on the local machine.
I run this as a persistent process with automatic reconnect:
import subprocess
import time
def maintain_tunnel():
while True:
proc = subprocess.Popen([
"ssh", "-N", "-R", "11434:localhost:11434",
"-o", "ServerAliveInterval=30",
"-o", "ServerAliveCountMax=3",
"-o", "StrictHostKeyChecking=no",
"-i", "/path/to/key",
"user@vps-ip"
])
proc.wait()
print(f"Tunnel died (exit {proc.returncode}), reconnecting in 5s")
time.sleep(5)
The tunnel process runs under a system service (Windows Task Scheduler on the local machine, or systemd on Linux). If the local machine reboots, the service restarts the tunnel on login. The VPS's FastAPI returns 503 while the tunnel is down — the frontend shows an appropriate error message.
The FastAPI server is thin. It receives a POST with the raw prompt, forwards it to Ollama, streams the response back, and attaches performance headers. The heavy lifting is all in Ollama and the model.
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import httpx, time, json
app = FastAPI()
OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
MODEL = "sinc-scatter"
@app.post("/scatter")
async def scatter(request: Request):
body = await request.json()
prompt = body.get("prompt", "")
t0 = time.time()
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(OLLAMA_URL, json={
"model": MODEL,
"prompt": prompt,
"stream": False,
"options": {"temperature": 0.3}
})
elapsed = time.time() - t0
result = resp.json()
text = result.get("response", "")
tokens = result.get("eval_count", 0)
tok_per_s = round(tokens / elapsed, 1) if elapsed > 0 else 0
# Log to JSONL
with open("/var/log/scatter-api.jsonl", "a") as f:
f.write(json.dumps({
"ts": t0, "prompt_len": len(prompt),
"tokens": tokens, "elapsed": round(elapsed, 3),
"tok_per_s": tok_per_s
}) + "\n")
return JSONResponse(
content={"sinc": json.loads(text)},
headers={
"X-Tok-Per-S": str(tok_per_s),
"X-Gen-Time": str(round(elapsed, 3)),
"X-Tokens": str(tokens)
}
)
Response headers expose X-Tok-Per-S, X-Gen-Time, and X-Tokens. The frontend can display these as diagnostic metadata. The JSONL log on the VPS is the persistent telemetry record.
I log every request to /var/log/scatter-api.jsonl with timestamp, prompt length, token count, elapsed time, and tokens/second. This gives me a real performance distribution rather than a benchmark number.
Across a sample of test requests:
For a structured output task that replaces a typing action, 1.5 seconds is fast enough. Users see a loading state, then the sinc JSON populates all 6 bands simultaneously.
This architecture has one hard dependency: the local machine must be on and connected. If my RTX 5090 workstation is off, the SSH tunnel is down, and AI Transform returns 503. I handle this gracefully in the frontend — the button becomes inactive with a "model offline" tooltip, and the standard client-side Transform still works.
There's also a single-user concurrency limitation. Ollama's default configuration handles one generation at a time. For a production site with high concurrent load, this would be a bottleneck. At current traffic levels it isn't — sinc JSON generation is fast enough that queue time is negligible. If concurrent load grows, the right solution is llama.cpp's server with --parallel or a batched inference setup. I'll cross that bridge when I get there.
The VPS itself is stateless — it's just a proxy. If the VPS goes down, nginx returns 502 for the AI Transform endpoint. The VPS can be replaced or restarted in under a minute without any state loss, because all state (model weights, logs) is on my local machine.
What hit me while building this: the architecture for zero-cost AI features is surprisingly simple. One capable local GPU, one cheap VPS as the public endpoint, SSH as the bridge. The model runs where the hardware is. The VPS is just the front door. I didn't need a cloud inference provider at all.
Try the AI Transform feature and see what runs behind it.
Try AI Transform →