A Local LLM Is Powering My Production Website at 290 Tokens/Second

March 25, 2025 · 7 min read · local-llm ollama sinc-llm ai-transform

Contents

  1. Why run locally instead of cloud
  2. The full request chain
  3. SSH reverse tunnel mechanics
  4. The FastAPI layer on the VPS
  5. Performance metrics and logging
  6. Tradeoffs and when this breaks
x(t) = Σ x(nT) · sinc((t − nT) / T)
The sinc-LLM scatter engine samples your prompt at 6 frequency bands. Every AI Transform call runs this decomposition locally at zero marginal cost.

Why run locally instead of cloud

The obvious answer is cost. Running Ollama on local hardware with a model that lives on disk costs zero per inference. Cloud APIs charge per token. For a feature that runs on every prompt decomposition, the per-call cost compounds fast. I was paying $0.002 per call and it was adding up to something I couldn't justify long-term.

But cost isn't the only reason. I realized there are three properties of the local setup that matter specifically for this feature:

Latency floor. A cloud API call involves DNS, TLS handshake, queue time at the provider's infrastructure, and response serialization. A local model call involves none of that. The RTX 5090 is on the same LAN as the SSH tunnel endpoint. Round-trip latency from the VPS to the model is 10-30ms, not 200-800ms.

Control. A fine-tuned local model produces deterministic output. Temperature 0.3 with a fixed seed gives the same sinc JSON for the same input every time. Cloud APIs drift with model updates. I don't want my scatter engine's output format to silently change when Anthropic ships a new Haiku version.

No rate limits. My sinc-scatter model can serve as many concurrent requests as the RTX 5090's VRAM and compute allows. No 429s, no tier limits, no burst pricing.

290 Tokens/sec (RTX 5090)
4.7GB Model size (Q4_K_M GGUF)
~1.5s Median response time
$0 Marginal cost per call

The full request chain

Here is every hop a request takes from browser to model and back:

Browser (sincllm.com) │ │ POST /api/ai-transform (HTTPS) ▼ nginx on VPS (public IP) │ │ proxy_pass http://127.0.0.1:8461 (HTTP, loopback) ▼ FastAPI server on VPS (port 8461) │ │ POST http://127.0.0.1:11434/api/generate (HTTP, loopback via SSH tunnel) ▼ SSH reverse tunnel (port 11434 → local machine) │ │ forwarded to localhost:11434 on local machine ▼ Ollama daemon (local machine, port 11434) │ │ loads sinc-scatter model → CUDA inference ▼ RTX 5090 (VRAM: 24GB, model loaded at startup) │ ▲ token stream │ └─ response travels back up the chain

Six hops. The slow ones are the SSH tunnel (adds ~15ms of latency relative to local loopback) and Ollama's first-token latency (the model is pre-loaded in VRAM so this is fast — ~50ms). Everything else is local loopback, which is nanosecond-scale.

SSH reverse tunnel mechanics

The local machine (RTX 5090 workstation) initiates an SSH connection to the VPS. It uses -R (remote port forwarding) to bind port 11434 on the VPS's loopback interface to port 11434 on the local machine's loopback interface.

ssh -N -R 11434:localhost:11434 user@vps-ip \
    -o ServerAliveInterval=30 \
    -o ServerAliveCountMax=3 \
    -o ExitOnForwardFailure=yes

The VPS's sshd must have GatewayPorts no and AllowTcpForwarding yes. With GatewayPorts no, the forwarded port only binds on the VPS's loopback — external clients can't reach it directly. Only the FastAPI process running on the VPS loopback can connect to 127.0.0.1:11434, which then forwards to Ollama on the local machine.

I run this as a persistent process with automatic reconnect:

import subprocess
import time

def maintain_tunnel():
    while True:
        proc = subprocess.Popen([
            "ssh", "-N", "-R", "11434:localhost:11434",
            "-o", "ServerAliveInterval=30",
            "-o", "ServerAliveCountMax=3",
            "-o", "StrictHostKeyChecking=no",
            "-i", "/path/to/key",
            "user@vps-ip"
        ])
        proc.wait()
        print(f"Tunnel died (exit {proc.returncode}), reconnecting in 5s")
        time.sleep(5)

The tunnel process runs under a system service (Windows Task Scheduler on the local machine, or systemd on Linux). If the local machine reboots, the service restarts the tunnel on login. The VPS's FastAPI returns 503 while the tunnel is down — the frontend shows an appropriate error message.

The FastAPI layer on the VPS

The FastAPI server is thin. It receives a POST with the raw prompt, forwards it to Ollama, streams the response back, and attaches performance headers. The heavy lifting is all in Ollama and the model.

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import httpx, time, json

app = FastAPI()

OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
MODEL = "sinc-scatter"

@app.post("/scatter")
async def scatter(request: Request):
    body = await request.json()
    prompt = body.get("prompt", "")

    t0 = time.time()
    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(OLLAMA_URL, json={
            "model": MODEL,
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.3}
        })

    elapsed = time.time() - t0
    result = resp.json()
    text = result.get("response", "")
    tokens = result.get("eval_count", 0)
    tok_per_s = round(tokens / elapsed, 1) if elapsed > 0 else 0

    # Log to JSONL
    with open("/var/log/scatter-api.jsonl", "a") as f:
        f.write(json.dumps({
            "ts": t0, "prompt_len": len(prompt),
            "tokens": tokens, "elapsed": round(elapsed, 3),
            "tok_per_s": tok_per_s
        }) + "\n")

    return JSONResponse(
        content={"sinc": json.loads(text)},
        headers={
            "X-Tok-Per-S": str(tok_per_s),
            "X-Gen-Time": str(round(elapsed, 3)),
            "X-Tokens": str(tokens)
        }
    )

Response headers expose X-Tok-Per-S, X-Gen-Time, and X-Tokens. The frontend can display these as diagnostic metadata. The JSONL log on the VPS is the persistent telemetry record.

Performance metrics and logging

I log every request to /var/log/scatter-api.jsonl with timestamp, prompt length, token count, elapsed time, and tokens/second. This gives me a real performance distribution rather than a benchmark number.

Across a sample of test requests:

For a structured output task that replaces a typing action, 1.5 seconds is fast enough. Users see a loading state, then the sinc JSON populates all 6 bands simultaneously.

Tradeoffs and when this breaks

This architecture has one hard dependency: the local machine must be on and connected. If my RTX 5090 workstation is off, the SSH tunnel is down, and AI Transform returns 503. I handle this gracefully in the frontend — the button becomes inactive with a "model offline" tooltip, and the standard client-side Transform still works.

The sinc-scatter model runs from VRAM — it's pre-loaded when Ollama starts. Cold start (loading the 4.7GB GGUF into VRAM) takes about 4 seconds. After that, first-token latency drops to ~50ms. I keep Ollama running as a persistent service so the model is always warm.

There's also a single-user concurrency limitation. Ollama's default configuration handles one generation at a time. For a production site with high concurrent load, this would be a bottleneck. At current traffic levels it isn't — sinc JSON generation is fast enough that queue time is negligible. If concurrent load grows, the right solution is llama.cpp's server with --parallel or a batched inference setup. I'll cross that bridge when I get there.

The VPS itself is stateless — it's just a proxy. If the VPS goes down, nginx returns 502 for the AI Transform endpoint. The VPS can be replaced or restarted in under a minute without any state loss, because all state (model weights, logs) is on my local machine.

What hit me while building this: the architecture for zero-cost AI features is surprisingly simple. One capable local GPU, one cheap VPS as the public endpoint, SSH as the bridge. The model runs where the hardware is. The VPS is just the front door. I didn't need a cloud inference provider at all.

290 tokens per second, zero per call

Try the AI Transform feature and see what runs behind it.

Try AI Transform →