How to Add Zero-Cost AI Features to Any Website Using Local Models

March 25, 2025 · 9 min read · tutorial local-llm fine-tuning ollama sinc-llm

Contents

  1. Overview and prerequisites
  2. Step 1: Define the task and output format
  3. Step 2: Generate training data with a teacher model
  4. Step 3: Fine-tune with Unsloth + LoRA
  5. Step 4: Export to GGUF and register with Ollama
  6. Step 5: Set up the SSH reverse tunnel
  7. Step 6: Add the FastAPI proxy on your VPS
  8. Step 7: Wire nginx and connect the frontend
  9. Full cost breakdown
x(t) = Σ x(nT) · sinc((t − nT) / T)
I applied this approach to sinc-LLM prompt decomposition. The pattern is general — it works for any narrow, structured-output AI task you currently pay an API to run.

Overview and prerequisites

The pattern is: identify an AI feature you're currently running via a cloud API, fine-tune a small local model to perform that feature with equal quality, and serve the model from local hardware through an SSH tunnel to your production VPS. The marginal cost per API call drops to zero.

I did this for sinc-LLM prompt decomposition, replacing Claude Haiku API calls at $0.002 each with a local Qwen2.5-7B model. Training took 107 seconds. The model now runs at 290 tokens/second on my RTX 5090 and serves the AI Transform feature on sincllm.com. I wanted to own my inference stack — this is how I got there.

This guide is general — you can apply it to any structured output task: text classification, entity extraction, code formatting, sentiment analysis, schema validation, or anything else where the output space is narrow enough to learn from examples.

Hardware

A local GPU with at least 8GB VRAM. RTX 3080 works. RTX 5090 runs at 290 tok/s. CPU-only is possible but slow.

Software

Python 3.10+, CUDA toolkit, Ollama, SSH access to a VPS running nginx.

API access

An account with any capable LLM API for teacher data generation. We used Anthropic (Haiku). OpenAI works too.

Time budget

~2 hours total: 30min data gen, 5min training, 30min tunnel setup, 30min nginx + frontend.

01

Define the task and output format

Before generating any training data, write down three things:

  1. The invariants. What must always be true about the output, regardless of input? For sinc JSON: always 6 bands, always in order, CONSTRAINTS always longest. Write these down as a validation function before you write a single line of training code.
  2. The schema. Exactly what JSON structure (or text format) should the model output? Be precise — field names, types, nesting. This becomes the system prompt for your teacher model and the validation logic for your fine-tuned model.
  3. The input distribution. What are the shortest, longest, strangest inputs the model will receive in production? Design your training set to cover these, not just the happy path.

Write the validation function first. This is the acceptance criterion for every training example and every inference result. If you can't validate the output programmatically, your invariants aren't precise enough.

02

Generate training data with a teacher model

Pick a capable teacher model. I used Claude Haiku because it's fast and cheap — $0.002 per call. GPT-4o-mini works too. The teacher's only job is to demonstrate the task on diverse inputs. Quality matters more than quantity: 120 high-quality examples beat 1,000 sloppy ones. I learned this when an early draft with more but noisier examples actually underperformed.

import anthropic, json, time
from pathlib import Path

client = anthropic.Anthropic(api_key="YOUR_KEY")

SYSTEM = """You are a [YOUR TASK] engine. Given any user input,
produce output in this exact format:
[YOUR SCHEMA HERE]
Output ONLY valid JSON/text. No explanation."""

prompts = [
    # short inputs
    "hello", "fix it", "help",
    # medium inputs
    "analyze the performance issue in our API",
    "write unit tests for the auth module",
    # long inputs
    "We need to migrate our PostgreSQL database from version 12 to 15...",
    # edge cases
    "???", "1234", "",
    # ... 120 total
]

output_path = Path("training_data.jsonl")
with output_path.open("w") as f:
    for i, prompt in enumerate(prompts):
        response = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=1024,
            system=SYSTEM,
            messages=[{"role": "user", "content": prompt}]
        )
        example = {
            "prompt": prompt,
            "completion": response.content[0].text
        }
        f.write(json.dumps(example) + "\n")
        print(f"[{i+1}/{len(prompts)}] done")
        time.sleep(0.1)  # basic rate limit respect

Validate every generated example immediately. If the teacher output fails your validation function, regenerate or manually fix it. Do not include invalid examples in training data — they teach the student to produce invalid output.

03

Fine-tune with Unsloth + LoRA

Install Unsloth following the official instructions for your CUDA version. The setup takes 5-10 minutes. After that, the training script is under 80 lines.

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import json

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                     "gate_proj","up_proj","down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Format training data in ChatML
def format_example(ex):
    return (f"<|im_start|>system\n{SYSTEM}\n<|im_end|>\n"
            f"<|im_start|>user\n{ex['prompt']}\n<|im_end|>\n"
            f"<|im_start|>assistant\n{ex['completion']}\n<|im_end|>")

data = [json.loads(l) for l in open("training_data.jsonl")]
texts = [{"text": format_example(ex)} for ex in data]

from datasets import Dataset
dataset = Dataset.from_list(texts)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        output_dir="./output",
        fp16=True,
    ),
)
trainer.train()

On my RTX 5090, 120 examples at 3 epochs takes about 107 seconds. On an RTX 3080, expect 8-12 minutes. Either way, it's fast enough to iterate on your training set multiple times in a single afternoon. That iteration speed changes how you think about fine-tuning — it stops feeling like a research project and starts feeling like debugging.

04

Export to GGUF and register with Ollama

# Export to GGUF Q4_K_M
model.save_pretrained_gguf(
    "my-model-gguf",
    tokenizer,
    quantization_method="q4_k_m"
)
# Produces: my-model-gguf/my-model-q4_k_m.gguf (~4.7GB for 7B)

Create a Modelfile and register with Ollama:

FROM ./my-model-q4_k_m.gguf

SYSTEM """
[YOUR SYSTEM PROMPT]
"""

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 2048
ollama create my-model -f Modelfile
ollama run my-model "test input"

Test the model against your validation function on all training examples plus 10-20 held-out prompts. If pass rate is below 80%, check your training data format. Most failures at this stage are ChatML formatting issues, not model quality issues.

05

Set up the SSH reverse tunnel

On your local machine, create an SSH key pair and add the public key to your VPS's ~/.ssh/authorized_keys. Then run the tunnel:

ssh -N -R 11434:localhost:11434 user@YOUR_VPS_IP \
    -i ~/.ssh/vps_key \
    -o ServerAliveInterval=30 \
    -o ServerAliveCountMax=3 \
    -o ExitOnForwardFailure=yes \
    -o StrictHostKeyChecking=no

On the VPS, confirm /etc/ssh/sshd_config has:

AllowTcpForwarding yes
GatewayPorts no

Test the tunnel: from the VPS, run curl http://127.0.0.1:11434/api/tags. You should see Ollama's model list from your local machine.

Run the tunnel as a persistent service. On Linux (local machine), create a systemd unit. On Windows, use Task Scheduler or NSSM to wrap the SSH command as a service that starts on boot and restarts on failure.

06

Add the FastAPI proxy on your VPS

pip install fastapi uvicorn httpx

# api.py
from fastapi import FastAPI
from fastapi.responses import JSONResponse
import httpx, time, json

app = FastAPI()

@app.post("/ai-feature")
async def ai_feature(request: dict):
    prompt = request.get("prompt", "")
    t0 = time.time()

    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(
            "http://127.0.0.1:11434/api/generate",
            json={
                "model": "my-model",
                "prompt": prompt,
                "stream": False,
                "options": {"temperature": 0.3}
            }
        )

    elapsed = time.time() - t0
    result = resp.json()
    output = result.get("response", "")

    return JSONResponse(
        content={"result": json.loads(output)},
        headers={"X-Gen-Time": str(round(elapsed, 3))}
    )

# Run: uvicorn api:app --host 127.0.0.1 --port 8461

Run uvicorn as a systemd service on the VPS so it restarts automatically.

07

Wire nginx and connect the frontend

Add a location block to your nginx config:

location /api/ai-feature {
    proxy_pass http://127.0.0.1:8461/ai-feature;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_read_timeout 30s;
    proxy_connect_timeout 5s;

    # CORS if needed
    add_header Access-Control-Allow-Origin *;
    add_header Access-Control-Allow-Methods "POST, OPTIONS";
    add_header Access-Control-Allow-Headers "Content-Type";
}
nginx -t && nginx -s reload

In your frontend, add the button and call:

async function runAIFeature(prompt) {
  const btn = document.getElementById('ai-btn')
  btn.disabled = true
  btn.textContent = 'Running...'

  try {
    const res = await fetch('/api/ai-feature', {
      method: 'POST',
      headers: {'Content-Type': 'application/json'},
      body: JSON.stringify({prompt})
    })
    const data = await res.json()
    renderResult(data.result)
  } catch (e) {
    // model offline — show fallback or error
    showFallback()
  } finally {
    btn.disabled = false
    btn.textContent = 'AI Feature'
  }
}
Always implement a fallback. When the local machine is off, the SSH tunnel is down, and the VPS returns 503. The frontend must handle this gracefully — either showing an error or falling back to a non-AI version of the feature. Never leave the user stuck.

Full cost breakdown

Training data generation (120 examples × $0.002) $0.24
Unsloth LoRA fine-tuning (electricity, ~107s on RTX 5090) ~$0.01
GGUF export and Ollama registration $0.00
VPS (existing, not additional) $0.00
Per-call cost at inference $0.00
Total setup cost ~$0.25

The $0.24 data generation cost is the entire investment. After that, every call to the AI feature is free. At $0.002 per call (Haiku pricing), break-even is 120 calls. After that, every call is net savings. I hit break-even the first day I deployed.

For high-volume features — anything running more than a few hundred calls per month — this architecture pays for itself within days. The only constraint is that your local GPU must stay on and connected. For features that can tolerate a fallback when the model is offline, this is a fully viable production architecture. I run it this way in production right now.

The pattern generalizes completely. I applied it to prompt decomposition, but the same seven steps work for text classification, named entity extraction, code review, sentiment scoring, or any other narrow structured-output task you currently pay an API to run. The key insight: if you can write a validation function for the output, you can distill a model to produce it.

See it running on sincllm.com

AI Transform is the live implementation of this exact architecture. Try it on your own prompts.

Try AI Transform →