How to Add Zero-Cost AI Features to Any Website Using Local Models

March 25, 2025 · 9 min read · tutorial local-llm fine-tuning ollama sinc-llm

Overview and prerequisites
Step 1: Define the task and output format
Step 2: Generate training data with a teacher model
Step 3: Fine-tune with Unsloth + LoRA
Step 4: Export to GGUF and register with Ollama
Step 5: Set up the SSH reverse tunnel
Step 6: Add the FastAPI proxy on your VPS
Step 7: Wire nginx and connect the frontend
Full cost breakdown

x(t) = Σ x(nT) · sinc((t − nT) / T)

I applied this approach to sinc-LLM prompt decomposition. The pattern is general — it works for any narrow, structured-output AI task you currently pay an API to run.

Overview and prerequisites

Here is the idea. Find an AI feature you pay a cloud API to run. Train a small local model to do the same job just as well. Then serve that model from your own computer through an SSH tunnel to your web server. The cost of each API call drops to zero.

I did this for sinc-LLM prompt decomposition. I replaced Claude Haiku API calls at $0.002 each with a local Qwen2.5-7B model. Training took 107 seconds. The model now runs at 290 tokens/second on my RTX 5090 and powers the AI Transform feature on sincllm.com. I wanted to own my inference stack. This is how I got there.

This guide works for any task where the output has a clear shape: text classification, entity extraction, code formatting, sentiment analysis, schema validation, or anything else where the output is narrow enough to learn from examples.

Hardware

A local GPU with at least 8 GB VRAM. An RTX 3080 works. An RTX 5090 runs at 290 tok/s. CPU-only is possible but slow.

Software

Python 3.10 or newer, the CUDA toolkit, Ollama, and SSH access to a VPS running nginx.

API access

An account with any capable LLM API to generate teacher data. We used Anthropic (Haiku). OpenAI works too.

Time budget

About 2 hours total: 30 min for data, 5 min for training, 30 min for the tunnel, 30 min for nginx and the frontend.

Define the task and output format

Before you generate any training data, write down three things.

The invariants. What must always be true about the output, no matter what the input is? For sinc JSON: always 6 bands, always in order, CONSTRAINTS always longest. Write these down as a validation function before you write a single line of training code.
The schema. What exact JSON structure (or text format) should the model output? Be precise about field names, types, and nesting. This becomes the system prompt for your teacher model and the validation logic for your fine-tuned model.
The input distribution. What are the shortest, longest, and strangest inputs the model will see in production? Design your training set to cover these cases, not just the easy ones.

Write the validation function first. It is the pass/fail test for every training example and every inference result. If you cannot validate the output with code, your invariants are not precise enough.

Generate training data with a teacher model

Pick a capable teacher model. I used Claude Haiku because it is fast and cheap at $0.002 per call. GPT-4o-mini works too. The teacher's only job is to show the task on many different inputs. Quality beats quantity: 120 good examples beat 1,000 messy ones. I learned this when an early draft with more but noisier examples actually did worse.

import anthropic, json, time
from pathlib import Path

client = anthropic.Anthropic(api_key="YOUR_KEY")

SYSTEM = """You are a [YOUR TASK] engine. Given any user input,
produce output in this exact format:
[YOUR SCHEMA HERE]
Output ONLY valid JSON/text. No explanation."""

prompts = [
    # short inputs
    "hello", "fix it", "help",
    # medium inputs
    "analyze the performance issue in our API",
    "write unit tests for the auth module",
    # long inputs
    "We need to migrate our PostgreSQL database from version 12 to 15...",
    # edge cases
    "???", "1234", "",
    # ... 120 total
]

output_path = Path("training_data.jsonl")
with output_path.open("w") as f:
    for i, prompt in enumerate(prompts):
        response = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=1024,
            system=SYSTEM,
            messages=[{"role": "user", "content": prompt}]
        )
        example = {
            "prompt": prompt,
            "completion": response.content[0].text
        }
        f.write(json.dumps(example) + "\n")
        print(f"[{i+1}/{len(prompts)}] done")
        time.sleep(0.1)  # basic rate limit respect

Check every example right after you generate it. If the teacher output fails your validation function, regenerate it or fix it by hand. Never put invalid examples in your training data. Invalid examples teach the student model to produce invalid output.

Fine-tune with Unsloth + LoRA

Install Unsloth using the official instructions for your CUDA version. Setup takes 5 to 10 minutes. After that, the training script is under 80 lines.

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
import json

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                     "gate_proj","up_proj","down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Format training data in ChatML
def format_example(ex):
    return (f"<|im_start|>system\n{SYSTEM}\n<|im_end|>\n"
            f"<|im_start|>user\n{ex['prompt']}\n<|im_end|>\n"
            f"<|im_start|>assistant\n{ex['completion']}\n<|im_end|>")

data = [json.loads(l) for l in open("training_data.jsonl")]
texts = [{"text": format_example(ex)} for ex in data]

from datasets import Dataset
dataset = Dataset.from_list(texts)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        output_dir="./output",
        fp16=True,
    ),
)
trainer.train()

On my RTX 5090, 120 examples at 3 epochs takes about 107 seconds. On an RTX 3080, expect 8 to 12 minutes. Either way, it is fast enough to try your training set many times in one afternoon. That speed changes how fine-tuning feels. It stops being a research project and starts being more like debugging.

Export to GGUF and register with Ollama

# Export to GGUF Q4_K_M
model.save_pretrained_gguf(
    "my-model-gguf",
    tokenizer,
    quantization_method="q4_k_m"
)
# Produces: my-model-gguf/my-model-q4_k_m.gguf (~4.7GB for 7B)

Create a Modelfile and register it with Ollama:

FROM ./my-model-q4_k_m.gguf

SYSTEM """
[YOUR SYSTEM PROMPT]
"""

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 2048

ollama create my-model -f Modelfile
ollama run my-model "test input"

Test the model with your validation function on all training examples plus 10 to 20 inputs it has never seen. If the pass rate is below 80%, check your training data format. Most failures at this stage are ChatML formatting problems, not model quality problems.

Set up the SSH reverse tunnel

On your local machine, create an SSH key pair. Add the public key to your VPS's ~/.ssh/authorized_keys. Then run the tunnel:

ssh -N -R 11434:localhost:11434 user@YOUR_VPS_IP \
    -i ~/.ssh/vps_key \
    -o ServerAliveInterval=30 \
    -o ServerAliveCountMax=3 \
    -o ExitOnForwardFailure=yes \
    -o StrictHostKeyChecking=no

On the VPS, make sure /etc/ssh/sshd_config contains:

AllowTcpForwarding yes
GatewayPorts no

Test the tunnel. From the VPS, run curl http://127.0.0.1:11434/api/tags. You should see Ollama's model list from your local machine.

Run the tunnel as a persistent service. On Linux (local machine), create a systemd unit. On Windows, use Task Scheduler or NSSM to wrap the SSH command as a service. Set it to start on boot and restart on failure.

Add the FastAPI proxy on your VPS

pip install fastapi uvicorn httpx

# api.py
from fastapi import FastAPI
from fastapi.responses import JSONResponse
import httpx, time, json

app = FastAPI()

@app.post("/ai-feature")
async def ai_feature(request: dict):
    prompt = request.get("prompt", "")
    t0 = time.time()

    async with httpx.AsyncClient(timeout=30) as client:
        resp = await client.post(
            "http://127.0.0.1:11434/api/generate",
            json={
                "model": "my-model",
                "prompt": prompt,
                "stream": False,
                "options": {"temperature": 0.3}
            }
        )

    elapsed = time.time() - t0
    result = resp.json()
    output = result.get("response", "")

    return JSONResponse(
        content={"result": json.loads(output)},
        headers={"X-Gen-Time": str(round(elapsed, 3))}
    )

# Run: uvicorn api:app --host 127.0.0.1 --port 8461

Run uvicorn as a systemd service on the VPS so it restarts on its own.

Wire nginx and connect the frontend

Add a location block to your nginx config file:

location /api/ai-feature {
    proxy_pass http://127.0.0.1:8461/ai-feature;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_read_timeout 30s;
    proxy_connect_timeout 5s;

    # CORS if needed
    add_header Access-Control-Allow-Origin *;
    add_header Access-Control-Allow-Methods "POST, OPTIONS";
    add_header Access-Control-Allow-Headers "Content-Type";
}

nginx -t && nginx -s reload

In your frontend, add the button and the fetch call:

async function runAIFeature(prompt) {
  const btn = document.getElementById('ai-btn')
  btn.disabled = true
  btn.textContent = 'Running...'

  try {
    const res = await fetch('/api/ai-feature', {
      method: 'POST',
      headers: {'Content-Type': 'application/json'},
      body: JSON.stringify({prompt})
    })
    const data = await res.json()
    renderResult(data.result)
  } catch (e) {
    // model offline — show fallback or error
    showFallback()
  } finally {
    btn.disabled = false
    btn.textContent = 'AI Feature'
  }
}

Always build a fallback. When the local machine is off, the SSH tunnel goes down and the VPS returns 503. The frontend must handle this cleanly, either showing an error or falling back to a version of the feature that does not need AI. Never leave the user stuck.

Full cost breakdown

Training data generation (120 examples × $0.002) $0.24

Unsloth LoRA fine-tuning (electricity, ~107s on RTX 5090) ~$0.01

GGUF export and Ollama registration $0.00

VPS (existing, not additional) $0.00

Per-call cost at inference $0.00

Total setup cost ~$0.25

The $0.24 data cost is the whole investment. After that, every call to the AI feature is free. At $0.002 per call (Haiku pricing), you break even after 120 calls. Every call after that is pure savings. I hit break-even on the first day I deployed.

For high-volume features (anything running more than a few hundred calls per month), this setup pays for itself within days. The only constraint is that your local GPU must stay on and connected. If your feature can show a fallback when the model is offline, this is a fully workable production setup. I run it this way in production right now.

The pattern works for any task. I applied it to prompt decomposition, but the same seven steps work for text classification, named entity extraction, code review, sentiment scoring, or any other narrow structured-output task you currently pay an API to run. The key insight is simple: if you can write a validation function for the output, you can train a model to produce it.

// Production AI Engineering

Build AI systems that hold up in production.

sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.

See what we do →