A Local LLM Is Powering My Production Website at 290 Tokens/Second

March 25, 2025 · 7 min read · local-llm ollama sinc-llm ai-transform

Contents

  1. Why run locally instead of cloud
  2. The full request chain
  3. SSH reverse tunnel mechanics
  4. The FastAPI layer on the VPS
  5. Performance metrics and logging
  6. Tradeoffs and when this breaks
x(t) = Σ x(nT) · sinc((t − nT) / T)
The sinc-LLM scatter engine samples your prompt at 6 frequency bands. Every AI Transform call runs this decomposition locally at zero marginal cost.

Why run locally instead of cloud

The simple answer is cost. Running Ollama on local hardware costs zero per inference. The model lives on disk. Cloud APIs charge per token. This feature runs on every prompt. The cost adds up fast. I was paying $0.002 per call. That number kept growing, and I could not justify it.

But cost is not the only reason. The local setup has three properties that matter for this feature:

Latency floor. A cloud API call needs DNS, a TLS handshake, queue time at the provider, and response packaging. A local model call needs none of that. The RTX 5090 is on the same local network as the SSH tunnel endpoint. Round-trip time from the VPS to the model is 10-30ms, not 200-800ms.

Control. A fine-tuned local model gives the same output every time. Temperature 0.3 with a fixed seed gives the same sinc JSON for the same input. Cloud APIs change when the provider updates the model. I do not want my scatter engine to silently break when Anthropic ships a new Haiku version.

No rate limits. My sinc-scatter model can handle as many requests at once as the RTX 5090 VRAM and compute allow. No 429 errors, no tier limits, no burst pricing.

290 Tokens/sec (RTX 5090)
4.7GB Model size (Q4_K_M GGUF)
~1.5s Median response time
$0 Marginal cost per call

The full request chain

Here is every step a request takes from the browser to the model and back:

Browser (sincllm.com) │ │ POST /api/ai-transform (HTTPS) ▼ nginx on VPS (public IP) │ │ proxy_pass http://127.0.0.1:8461 (HTTP, loopback) ▼ FastAPI server on VPS (port 8461) │ │ POST http://127.0.0.1:11434/api/generate (HTTP, loopback via SSH tunnel) ▼ SSH reverse tunnel (port 11434 → local machine) │ │ forwarded to localhost:11434 on local machine ▼ Ollama daemon (local machine, port 11434) │ │ loads sinc-scatter model → CUDA inference ▼ RTX 5090 (VRAM: 24GB, model loaded at startup) │ ▲ token stream │ └─ response travels back up the chain

Six steps. The slow ones are the SSH tunnel (adds about 15ms compared to local loopback) and Ollama first-token time (the model is already loaded in VRAM, so this is fast, about 50ms). Everything else runs on local loopback, which is nanosecond-scale.

SSH reverse tunnel mechanics

The local machine (RTX 5090 workstation) opens an SSH connection to the VPS. It uses -R (remote port forwarding) to bind port 11434 on the VPS loopback to port 11434 on the local machine loopback.

ssh -N -R 11434:localhost:11434 user@vps-ip \
    -o ServerAliveInterval=30 \
    -o ServerAliveCountMax=3 \
    -o ExitOnForwardFailure=yes

The VPS sshd needs GatewayPorts no and AllowTcpForwarding yes. With GatewayPorts no, the forwarded port only binds on the VPS loopback. Outside clients cannot reach it directly. Only the FastAPI process on the VPS loopback can connect to 127.0.0.1:11434. That connection then forwards to Ollama on the local machine.

I run this as a persistent process that reconnects automatically:

import subprocess
import time

def maintain_tunnel():
    while True:
        proc = subprocess.Popen([
            "ssh", "-N", "-R", "11434:localhost:11434",
            "-o", "ServerAliveInterval=30",
            "-o", "ServerAliveCountMax=3",
            "-o", "StrictHostKeyChecking=no",
            "-i", "/path/to/key",
            "user@vps-ip"
        ])
        proc.wait()
        print(f"Tunnel died (exit {proc.returncode}), reconnecting in 5s")
        time.sleep(5)

The tunnel runs under a system service (Windows Task Scheduler on the local machine, or systemd on Linux). If the local machine reboots, the service restarts the tunnel on login. The VPS FastAPI returns 503 while the tunnel is down. The frontend shows an error message in that case.

The FastAPI layer on the VPS

The FastAPI server is small. It gets a POST with the raw prompt, sends it to Ollama, streams the response back, and adds performance headers. Ollama and the model do all the heavy work.

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import httpx, time, json

app = FastAPI()

OLLAMA_URL = "http://127.0.0.1:11434/api/generate"
MODEL = "sinc-scatter"

@app.post("/scatter")
async def scatter(request: Request):
    body = await request.json()
    prompt = body.get("prompt", "")

    t0 = time.time()
    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(OLLAMA_URL, json={
            "model": MODEL,
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.3}
        })

    elapsed = time.time() - t0
    result = resp.json()
    text = result.get("response", "")
    tokens = result.get("eval_count", 0)
    tok_per_s = round(tokens / elapsed, 1) if elapsed > 0 else 0

    # Log to JSONL
    with open("/var/log/scatter-api.jsonl", "a") as f:
        f.write(json.dumps({
            "ts": t0, "prompt_len": len(prompt),
            "tokens": tokens, "elapsed": round(elapsed, 3),
            "tok_per_s": tok_per_s
        }) + "\n")

    return JSONResponse(
        content={"sinc": json.loads(text)},
        headers={
            "X-Tok-Per-S": str(tok_per_s),
            "X-Gen-Time": str(round(elapsed, 3)),
            "X-Tokens": str(tokens)
        }
    )

Response headers carry X-Tok-Per-S, X-Gen-Time, and X-Tokens. The frontend can show these as diagnostic data. The JSONL log on the VPS is the permanent record of every call.

Performance metrics and logging

I log every request to /var/log/scatter-api.jsonl with timestamp, prompt length, token count, time taken, and tokens per second. This gives a real picture of performance, not just a single benchmark number.

From a sample of test requests:

For a task that fills in structured output, 1.5 seconds is fast enough. Users see a loading state. Then the sinc JSON fills all 6 bands at once.

Tradeoffs and when this breaks

This setup has one hard requirement: the local machine must be on and connected. If my RTX 5090 workstation is off, the SSH tunnel goes down. AI Transform then returns 503. I handle this in the frontend. The button becomes inactive with a "model offline" tooltip. The standard client-side Transform still works.

The sinc-scatter model runs from VRAM — it's pre-loaded when Ollama starts. Cold start (loading the 4.7GB GGUF into VRAM) takes about 4 seconds. After that, first-token latency drops to ~50ms. I keep Ollama running as a persistent service so the model is always warm.

There is also a concurrency limit. Ollama default setup handles one generation at a time. On a site with heavy traffic, this would be a bottleneck. At current traffic levels it is not. sinc JSON generation is fast enough that queue time is small. If load grows, the fix is llama.cpp server with --parallel or a batched inference setup. I will address that when needed.

The VPS is stateless. It is just a proxy. If the VPS goes down, nginx returns 502 for the AI Transform endpoint. The VPS can be replaced or restarted in under a minute with no data loss. All state (model weights, logs) lives on my local machine.

What struck me while building this: the setup for zero-cost AI features is surprisingly simple. One capable local GPU, one cheap VPS as the public endpoint, SSH as the bridge. The model runs where the hardware is. The VPS is just the front door. I did not need a cloud inference provider at all.

// Production AI Engineering

Build AI systems that hold up in production.

sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.

See what we do →