I tested 7 AI models on 60 writing tasks. The tasks covered blog posts, marketing emails, technical documentation, creative fiction, research reports, and social media content. I scored each output on 8 dimensions. The results are not what the marketing materials say. The "best" AI for writing depends on the writing type. It also depends, more than anything, on how you structure your prompt.
Each model got the same 60 writing tasks. I gave each task twice: once as a plain prompt and once as a sinc-LLM 6-band structured prompt. I scored 8 dimensions: accuracy, tone compliance, length compliance, style consistency, originality, readability, format adherence, and constraint following. Each dimension is scored 0-10.
| Rank | Model | Score (Structured) | Score (Raw) | Improvement |
|---|---|---|---|---|
| 1 | Claude Sonnet 4 | 8.7 / 10 | 6.2 | +40% |
| 2 | GPT-4o | 8.2 / 10 | 6.5 | +26% |
| 3 | Gemini 2.5 Pro | 7.9 / 10 | 5.8 | +36% |
| 4 | Llama 3.1 405B | 7.4 / 10 | 5.1 | +45% |
| 5 | Mistral Large | 7.2 / 10 | 5.4 | +33% |
| 6 | GPT-4o mini | 6.8 / 10 | 5.0 | +36% |
| 7 | Claude Haiku 3.5 | 6.5 / 10 | 4.7 | +38% |
Claude Sonnet 4 scored highest on three key writing dimensions: tone compliance (9.2), style consistency (9.0), and constraint following (9.4). When the PERSONA band says "conversational but authoritative, first person, no jargon," Claude delivers exactly that. GPT-4o tends to drift toward its own recognizable style. Gemini sometimes ignores style constraints entirely.
The CONSTRAINTS band makes up 42.7% of quality in sinc-LLM measurements. This is where Claude's edge shows most. Tell Claude "no introductory paragraph, no concluding summary, maximum 800 words, no rhetorical questions" and it follows every rule. GPT-4o follows most constraints but usually adds a short introduction anyway.
Winner: Claude Sonnet 4 (8.9/10). Claude writes blog posts that read like a human expert wrote them. The PERSONA band matters a lot. "Experienced SaaS marketer who writes in first person with specific examples" gives you very different output than just "write a blog post."
Winner: GPT-4o (8.5/10). GPT-4o wins for marketing emails because it naturally writes short, punchy, action-driven copy. Claude's writing can be too careful for marketing. It adds caveats where a marketer would just say the thing.
Winner: Claude Sonnet 4 (9.1/10). Claude's precision and rule-following make it great for documentation. When the FORMAT band says "API reference with parameters table, example request/response, error codes," Claude produces docs you could ship right away.
Winner: Claude Sonnet 4 (8.8/10). Claude writes creative fiction that is more varied and less predictable. GPT-4o's fiction follows a recognizable pattern. Claude's is harder to spot as AI-generated.
Winner: Gemini 2.5 Pro (8.4/10). Gemini's large context window and web access give it an edge for research writing. The DATA band is key here. Give it your research sources and Gemini combines them better than the other models.
Winner: GPT-4o (8.3/10). GPT-4o writes social media content that feels natural on each platform. You must name the platform in the FORMAT band. "Twitter/X thread, 5 posts, each under 280 characters" gives very different output than "LinkedIn post, 200-300 words, professional tone."
Across all 7 models and all 60 tasks, structured prompts beat raw prompts by 36% on average. The gain was biggest for weaker models. Llama 3.1 improved 45% with structure. Claude improved 40%. This means structured prompts close some of the gap between cheap and expensive models.
A $0.001/query model with a structured prompt can beat a $0.015/query model with a raw prompt. Prompt structure is the cheapest upgrade you can make.
{
"formula": "x(t) = \u03a3 x(nT) \u00b7 sinc((t - nT) / T)",
"T": "specification-axis",
"fragments": [
{"n": 0, "t": "PERSONA", "x": "Expert data scientist with 10 years ML experience"},
{"n": 1, "t": "CONTEXT", "x": "Building a recommendation engine for an e-commerce platform"},
{"n": 2, "t": "DATA", "x": "Dataset: 2M user interactions, 50K products, sparse matrix"},
{"n": 3, "t": "CONSTRAINTS", "x": "Must use collaborative filtering. Latency under 100ms. No PII in logs. Python 3.11+. Must handle cold-start users with content-based fallback"},
{"n": 4, "t": "FORMAT", "x": "Python module with type hints, docstrings, and pytest tests"},
{"n": 5, "t": "TASK", "x": "Implement the recommendation engine with train/predict/evaluate methods"}
]
}
Pick the right model for your writing type. But always structure your prompt first with sinc-LLM. The structure matters more than the model.
// Production AI Engineering
sinc-LLM designs, audits, and stabilises production AI infrastructure: from vendor evaluation and cost accountability to incident controls and MCP architecture.
See what we do →