I ran 100 coding tasks through both ChatGPT (GPT-4o) and Claude (Sonnet 4), each task with both raw prompts and sinc-LLM structured prompts. The results surprised me — not which model is "better" (that depends on the task) but how differently each model responds to prompt structure.
100 coding tasks across 5 categories: (1) Algorithm implementation (20 tasks), (2) API development (20 tasks), (3) Bug fixing (20 tasks), (4) Code refactoring (20 tasks), (5) Database operations (20 tasks). Each task was run with a raw prompt and a sinc-LLM 6-band structured prompt on both models. Total: 400 runs.
| Metric | GPT-4o Raw | GPT-4o Structured | Claude Raw | Claude Structured |
|---|---|---|---|---|
| Code runs correctly | 47% | 78% | 54% | 86% |
| Type-safe / no lint errors | 38% | 85% | 62% | 91% |
| Follows constraints | 22% | 81% | 35% | 93% |
| Has error handling | 31% | 88% | 44% | 94% |
| Includes tests | 8% | 72% | 12% | 82% |
Claude's improvement from raw to structured prompts is consistently larger than GPT-4o's. Code correctness jumps 32 percentage points for Claude (54% → 86%) versus 31 for GPT-4o (47% → 78%). But the real difference is in constraint following: Claude jumps 58 points (35% → 93%) versus GPT-4o's 59 points (22% → 81%).
Both models improve dramatically with structure, but Claude achieves higher absolute scores on structured prompts. This is likely because Claude's training emphasizes following detailed instructions — the more detailed your instructions, the more Claude has to work with.
On raw prompts, GPT-4o and Claude are closer in performance than you might expect. But GPT-4o is better at inferring what you probably want from a vague prompt. It seems to have stronger priors for common coding patterns, which means it produces something useful more often from underspecified input.
The irony: this "better guessing" disappears when you use structured prompts. When all 6 bands are specified, there is nothing left to guess. And Claude's superior constraint-following takes over.
Both models produce acceptable code from raw prompts about 50% of the time. The gap opens when you add constraints. "Must handle null inputs," "Must use async/await," "Must be compatible with Python 3.11+," "Maximum 50 lines" — these constraints dramatically change the output, and Claude follows them more reliably than GPT-4o.
This aligns with sinc-LLM's measurement that CONSTRAINTS carries 42.7% of reconstruction quality. Both models confirm it empirically.
GPT-4o edges ahead on algorithms with raw prompts (52% vs 48% correctness). With structured prompts, Claude takes the lead (90% vs 82%). Claude's structured algorithm implementations tend to include edge case handling that GPT-4o omits unless explicitly constrained.
Claude dominates API development in structured mode (92% vs 76%). Claude produces more complete API code — authentication, error handling, input validation, and OpenAPI documentation are more often included. GPT-4o tends to produce the happy path only.
Both models perform similarly on bug fixing with structured prompts (84% vs 82%). The PERSONA band is critical here — specifying "senior developer with debugging expertise, 10 years experience with production systems" produces dramatically better bug analysis than "fix this bug."
Claude excels at refactoring with structured prompts (88% vs 72%). Claude is more conservative with refactoring — it preserves behavior while improving structure. GPT-4o sometimes introduces subtle behavior changes during refactoring, especially with complex conditional logic.
GPT-4o slightly leads on database tasks with structured prompts (80% vs 78%). Both models benefit enormously from the DATA band — providing the exact schema and sample data eliminates most database-related hallucinations.
For coding tasks with sinc-LLM structured prompts:
{
"formula": "x(t) = \u03a3 x(nT) \u00b7 sinc((t - nT) / T)",
"T": "specification-axis",
"fragments": [
{"n": 0, "t": "PERSONA", "x": "Expert data scientist with 10 years ML experience"},
{"n": 1, "t": "CONTEXT", "x": "Building a recommendation engine for an e-commerce platform"},
{"n": 2, "t": "DATA", "x": "Dataset: 2M user interactions, 50K products, sparse matrix"},
{"n": 3, "t": "CONSTRAINTS", "x": "Must use collaborative filtering. Latency under 100ms. No PII in logs. Python 3.11+. Must handle cold-start users with content-based fallback"},
{"n": 4, "t": "FORMAT", "x": "Python module with type hints, docstrings, and pytest tests"},
{"n": 5, "t": "TASK", "x": "Implement the recommendation engine with train/predict/evaluate methods"}
]
}
The model matters less than the prompt structure. A structured prompt on either model beats a raw prompt on the other. Start structuring at sincllm.com.