I ran 100 coding tasks through both ChatGPT (GPT-4o) and Claude (Sonnet 4). Each task used a raw prompt and a sinc-LLM structured prompt. The results surprised me. Not which model is "better" (that depends on the task), but how differently each model reacts to structure.
100 coding tasks across 5 categories: (1) Algorithm implementation (20 tasks), (2) API development (20 tasks), (3) Bug fixing (20 tasks), (4) Code refactoring (20 tasks), (5) Database operations (20 tasks). Each task ran with a raw prompt and a sinc-LLM 6-band structured prompt on both models. Total: 400 runs.
| Metric | GPT-4o Raw | GPT-4o Structured | Claude Raw | Claude Structured |
|---|---|---|---|---|
| Code runs correctly | 47% | 78% | 54% | 86% |
| Type-safe / no lint errors | 38% | 85% | 62% | 91% |
| Follows constraints | 22% | 81% | 35% | 93% |
| Has error handling | 31% | 88% | 44% | 94% |
| Includes tests | 8% | 72% | 12% | 82% |
Claude improves more from raw to structured prompts than GPT-4o does. Code correctness jumps 32 percentage points for Claude (54% to 86%) versus 31 for GPT-4o (47% to 78%). The bigger gap is in constraint following: Claude jumps 58 points (35% to 93%) versus GPT-4o's 59 points (22% to 81%).
Both models improve a lot with structure. But Claude reaches higher scores on structured prompts. This is probably because Claude is trained to follow detailed instructions closely. The more detail you give, the better Claude does.
On raw prompts, GPT-4o and Claude are closer in score than you might think. But GPT-4o is better at guessing what you want from a vague prompt. It has a stronger sense of common coding patterns. That means it gives you something useful more often, even when the prompt is unclear.
The irony is that this better guessing disappears once you use structured prompts. When all 6 bands are filled in, there is nothing left to guess. At that point, Claude's stronger constraint following takes over.
Both models give acceptable code from raw prompts about 50% of the time. The gap opens when you add constraints. "Must handle null inputs," "Must use async/await," "Must be compatible with Python 3.11+," "Maximum 50 lines": these constraints change the output a lot. Claude follows them more reliably than GPT-4o.
This matches sinc-LLM's measurement: CONSTRAINTS carries 42.7% of reconstruction quality. Both models confirm this in the data.
GPT-4o edges ahead on algorithms with raw prompts (52% vs 48% correctness). With structured prompts, Claude takes the lead (90% vs 82%). Claude's structured code tends to handle edge cases that GPT-4o skips unless you spell them out.
Claude leads on API development in structured mode (92% vs 76%). Claude produces more complete API code. Authentication, error handling, input validation, and OpenAPI documentation are more often included. GPT-4o tends to produce only the happy path.
Both models score similarly on bug fixing with structured prompts (84% vs 82%). The PERSONA band matters most here. Saying "senior developer with debugging expertise, 10 years experience with production systems" gives you much better bug analysis than just saying "fix this bug."
Claude leads on refactoring with structured prompts (88% vs 72%). Claude is careful. It keeps the behavior the same while improving the structure. GPT-4o sometimes changes behavior slightly during refactoring, especially with complex conditional logic.
GPT-4o leads slightly on database tasks with structured prompts (80% vs 78%). Both models improve a lot from the DATA band. Providing the exact schema and sample data removes most database-related mistakes.
For coding tasks with sinc-LLM structured prompts, here is the guide:
{
"formula": "x(t) = \u03a3 x(nT) \u00b7 sinc((t - nT) / T)",
"T": "specification-axis",
"fragments": [
{"n": 0, "t": "PERSONA", "x": "Expert data scientist with 10 years ML experience"},
{"n": 1, "t": "CONTEXT", "x": "Building a recommendation engine for an e-commerce platform"},
{"n": 2, "t": "DATA", "x": "Dataset: 2M user interactions, 50K products, sparse matrix"},
{"n": 3, "t": "CONSTRAINTS", "x": "Must use collaborative filtering. Latency under 100ms. No PII in logs. Python 3.11+. Must handle cold-start users with content-based fallback"},
{"n": 4, "t": "FORMAT", "x": "Python module with type hints, docstrings, and pytest tests"},
{"n": 5, "t": "TASK", "x": "Implement the recommendation engine with train/predict/evaluate methods"}
]
}
The model matters less than the prompt structure. A structured prompt on either model beats a raw prompt on the other. Start structuring at sincllm.com.
// Production AI Engineering
sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.
See what we do →