Most tools cost money to run. This one earns money. I built a small Python server. It catches every prompt I send to Claude. It rewrites the prompt right away. Then it sends the new version to Claude before Claude reads the original. The server costs $0.002 each time it runs. Each run saves me $0.08 in wasted replies. That is 38 times the cost. In one week it saved me $1,588.56. It cost $42.39 to run.
Here is how I built it.
The interceptor is a Python FastAPI server. It runs on your own machine. Claude Code uses a PreToolUse hook. That hook fires every time you send a message. The hook sends the raw prompt text to localhost:8461/scatter via HTTP POST. The server reads the prompt and returns structured sinc JSON. That JSON goes into the system message before Claude reads your prompt.
# Simplified core of scatter_server.py
from fastapi import FastAPI
import anthropic, json
app = FastAPI()
client = anthropic.Anthropic()
SCATTER_SYSTEM = """
Decompose the user's prompt into sinc JSON with 6 fragments:
n=0 PERSONA, n=1 CONTEXT, n=2 DATA, n=3 CONSTRAINTS (longest),
n=4 FORMAT, n=5 TASK. Return ONLY valid JSON.
"""
@app.post("/scatter")
async def scatter(payload: dict):
prompt = payload.get("prompt", "")
# Pass-through if already sinc JSON
if is_sinc_json(prompt):
return {"scattered": prompt, "passthrough": True}
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
system=SCATTER_SYSTEM,
messages=[{"role": "user", "content": prompt}]
)
sinc_json = json.loads(response.content[0].text)
return {"scattered": json.dumps(sinc_json), "passthrough": False}
The structured output follows the sinc-LLM format. There are six fragments. Each fragment has an index (n), a type (t), and the content (x). The formula is stored as a header field. It is the math foundation of the whole system.
Each fragment tells the model one thing it needs to give a good first reply. CONSTRAINTS (n=3) gets the most content. It carries 42.7% of the quality weight. When the model knows what it cannot do, it stops wasting a reply doing the wrong thing. You do not have to ask twice.
My first version blocked every prompt while it waited for the Haiku API to reply. If the API was slow or down, everything stopped. I watched Claude Code freeze while the scatter call timed out.
The fix had two parts. First, I added a 25-second timeout on the scatter call. Second, I added a fast fallback that runs if the API fails. The fallback uses templates and keyword guessing. It is not as good as Haiku, but it never blocks your prompt. The prompt always goes through.
I also added pass-through rules. If the prompt starts with @agent (going to a sub-agent), or /slash (a slash command), or it is already valid sinc JSON, the server skips scatter. There is no point rewriting something that is already structured or already routed.
In Claude Code, hooks are set up in settings.json. Here is the hook entry that connects the scatter server to every user message:
{
"hooks": {
"PreToolUse": [
{
"matcher": ".*",
"hooks": [{
"type": "command",
"command": "py -X utf8 C:/Users/Mario/scatter_hook.py"
}]
}
]
}
}
The scatter_hook.py script reads the prompt from stdin. It POSTs to localhost:8461/scatter. Then it writes the sinc JSON to the system message injection point. The whole round trip adds 300-800ms of wait time per prompt. That is the Haiku API call time. For me that wait is fine given the return on investment.
21,194 prompts intercepted. The exchange rate dropped from 4.2 to 1.6. I saved $1,588.56. The screenshot in my other article shows the real numbers from my dashboard. This is not hypothetical. It worked exactly as designed. It keeps working every day I use it.
The code is open source on GitHub. Leave a comment and I will drop the link. I want to make sure people find it useful before I share it widely.
// Production AI Engineering
sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.
See what we do →