I tested 7 AI models on 60 writing tasks — blog posts, marketing emails, technical documentation, creative fiction, research reports, and social media content. I scored each output on 8 dimensions. The results are not what the marketing materials suggest. The "best" AI for writing depends entirely on the writing type and, more importantly, on how you structure your prompt.
Each model received the same 60 writing tasks, both as raw prompts and as sinc-LLM 6-band structured prompts. Scoring dimensions: accuracy, tone compliance, length compliance, style consistency, originality, readability, format adherence, and constraint following. Each dimension scored 0-10.
| Rank | Model | Score (Structured) | Score (Raw) | Improvement |
|---|---|---|---|---|
| 1 | Claude Sonnet 4 | 8.7 / 10 | 6.2 | +40% |
| 2 | GPT-4o | 8.2 / 10 | 6.5 | +26% |
| 3 | Gemini 2.5 Pro | 7.9 / 10 | 5.8 | +36% |
| 4 | Llama 3.1 405B | 7.4 / 10 | 5.1 | +45% |
| 5 | Mistral Large | 7.2 / 10 | 5.4 | +33% |
| 6 | GPT-4o mini | 6.8 / 10 | 5.0 | +36% |
| 7 | Claude Haiku 3.5 | 6.5 / 10 | 4.7 | +38% |
Claude Sonnet 4 scored highest on three critical writing dimensions: tone compliance (9.2), style consistency (9.0), and constraint following (9.4). When the PERSONA band specifies "conversational but authoritative, first person, no jargon," Claude delivers exactly that. GPT-4o tends to drift toward its recognizable house style. Gemini sometimes ignores style constraints entirely.
The CONSTRAINTS band (42.7% of quality in sinc-LLM measurements) is where Claude's advantage is most visible. Tell Claude "no introductory paragraph, no concluding summary, maximum 800 words, no rhetorical questions" and it follows every constraint. GPT-4o follows most constraints but often adds a brief introduction anyway.
Winner: Claude Sonnet 4 (8.9/10). Claude produces blog posts that read like they were written by a human with domain expertise. The PERSONA band is critical — "experienced SaaS marketer who writes in first person with specific examples" produces dramatically different output than "write a blog post."
Winner: GPT-4o (8.5/10). GPT-4o edges ahead for marketing emails because it naturally produces punchy, action-oriented copy. Claude's writing is sometimes too nuanced for marketing — it adds caveats where a marketer would not.
Winner: Claude Sonnet 4 (9.1/10). Claude's precision and constraint following make it ideal for documentation. When the FORMAT band specifies "API reference with parameters table, example request/response, error codes," Claude produces documentation that could ship as-is.
Winner: Claude Sonnet 4 (8.8/10). Claude produces more varied, less predictable creative writing. GPT-4o's creative fiction has a recognizable pattern — Claude's is harder to identify as AI-generated.
Winner: Gemini 2.5 Pro (8.4/10). Gemini's large context window and web connectivity give it an edge for research-heavy writing. The DATA band is critical — provide the research sources and Gemini synthesizes them better than the other models.
Winner: GPT-4o (8.3/10). GPT-4o produces social media content that feels native to each platform. The FORMAT band must specify the platform — "Twitter/X thread, 5 posts, each under 280 characters" produces different output than "LinkedIn post, 200-300 words, professional tone."
Across all 7 models and all 60 tasks, structured prompts outperformed raw prompts by an average of 36%. The improvement was largest for weaker models — Llama 3.1 improved 45% with structure, compared to Claude's 40%. This means structured prompts partially close the gap between expensive and cheap models.
A $0.001/query model with structured prompts produces better writing than a $0.015/query model with raw prompts. Prompt structure is the cheapest upgrade available.
{
"formula": "x(t) = \u03a3 x(nT) \u00b7 sinc((t - nT) / T)",
"T": "specification-axis",
"fragments": [
{"n": 0, "t": "PERSONA", "x": "Expert data scientist with 10 years ML experience"},
{"n": 1, "t": "CONTEXT", "x": "Building a recommendation engine for an e-commerce platform"},
{"n": 2, "t": "DATA", "x": "Dataset: 2M user interactions, 50K products, sparse matrix"},
{"n": 3, "t": "CONSTRAINTS", "x": "Must use collaborative filtering. Latency under 100ms. No PII in logs. Python 3.11+. Must handle cold-start users with content-based fallback"},
{"n": 4, "t": "FORMAT", "x": "Python module with type hints, docstrings, and pytest tests"},
{"n": 5, "t": "TASK", "x": "Implement the recommendation engine with train/predict/evaluate methods"}
]
}
Choose the right model for your writing type. But always structure the prompt first with sinc-LLM. The structure matters more than the model.