Most CFOs see the AI line item double in 12 months and cannot name what doubled. This is the 9-question audit your finance team can run on Friday morning, your engineering team can fix in six weeks, and your board chair will not push back on. Each question has a healthy state, a watch state, a bleeding state, and the lever that recovers the dollars.
I learned cost discipline running electrical systems in Luanda for seven years, where every kilowatt-hour you waste shows up on a real invoice that a real CFO has to defend at a real board meeting. The grid does not forgive sloppy load profiles. Neither does production AI infrastructure. The difference is that AI vendors do not send you a load profile. They send you a total. And the total grows.
I have audited AI bills for production teams shipping live workloads. Every one of them had recoverable spend. The cheapest finding is usually idle GPU time on reserved capacity. The highest-leverage finding is usually model-tier mismatch (calls running on a $15-per-million model that would have been just as good on a $0.50-per-million model). The most embarrassing finding is usually shadow AI tools nobody in finance has ever seen.
The pattern is not that AI is expensive. The pattern is that most companies have never run an audit specifically on the AI line item, because AI feels new and inscrutable, and so the cost is treated as a tax instead of a measurable system. This audit treats it as a measurable system. The 9 questions below cover where the money goes, what healthy looks like, what bleeding looks like, and what to do about each one. Run it in one Friday afternoon. Hand it to your engineering team. Watch the next quarterly close.
Each question has three states: a healthy state, a watch state, and a bleeding state. The full PDF includes the lever that recovers the spend on each, plus a one-page scorecard you take into your next finance review.
Take total monthly AI spend, divide by number of successful AI-touched outcomes (not API calls). If you cannot compute the metric, that is the diagnosis. AI delivers value per outcome, not per call. Most teams are tracking the denominator and not the numerator, and so the ratio is invisible.
Healthy: ratio is computable, trending flat or down vs revenue per outcome Watch: cost per outcome rising 10%+ MoM with stagnant outcome quality Bleeding: outcomes are not instrumented, ratio is uncomputable
If you run dedicated AI infrastructure (reserved GPU instances, fine-tuned model deployments, vector databases), how many hours per month is the infrastructure billed and idle? "Reserved capacity" is a polite phrase for "paid whether used or not". The serverless alternative bills only on use.
Healthy: below 5% idle, or fully serverless Watch: 5–20% idle, especially overnight or weekend Bleeding: above 20% idle, reserved instances on unused capacity
What percent of your premium-tier model calls (Opus, GPT-5, Gemini-Ultra) are doing work the mid-tier (Sonnet, GPT-4o, Gemini-Pro) or low-tier (Haiku, Flash) would handle just as well? The cost ratio between tiers is 5x to 30x. The work-quality ratio on bounded tasks is rarely more than 1.1x. The math nearly always favors downshifting.
Healthy: below 20% premium-tier, tier-routing rule documented Watch: 20–50% premium-tier, no routing rule in place Bleeding: above 50% premium-tier, "we just use the best model for everything"
All major LLM vendors offer prompt caching with a 5-minute TTL window. The cache key is the system prompt and early conversation. If you make 10 calls in 4 minutes with a stable prompt, you hit 90% cache. If you make the same 10 calls in 6 minutes, you hit 0%. Most teams do not measure the difference. Their bill does.
Healthy: above 80% cache hit on repeat workflows Watch: 30–80% cache hit, opportunity unaddressed Bleeding: below 30% cache hit, system prompts rebuilt every call
What percent of your total AI spend goes to your largest single vendor? A high concentration is not a sin. A high concentration with no priced-out alternative is. The vendor sets your pricing if you have not credibly tested a substitute in the last 12 months.
Healthy: below 60% to top vendor, alternatives tested in last year Watch: 60–80% to top vendor, alternatives known but untested Bleeding: above 80% to single vendor, no documented swap path
List every AI vendor contract. Mark the dollar value and renewal date of each. How much spend renews in the next 90 days without an explicit re-up decision? Auto-renewals are how vendor pricing escalates without anyone defending the increase. The pre-renewal window is the only leverage you have.
Healthy: zero auto-renewals, or all reviewed and intentionally re-upped Watch: below 25% of AI spend on un-reviewed auto-renewals Bleeding: above 50% on auto-renewals not priced against alternatives
How many AI tools are billed to corporate cards, department budgets, or individual subscriptions that do not appear in your central vendor consolidation? Shadow AI is the gap between what finance sees and what the company spends. It is usually 10 to 30 percent of the real number.
Healthy: zero shadow tools, SSO-required AI policy, central procurement Watch: below 10% of spend on shadow tools Bleeding: above 20% shadow, CISO has no visibility, data routes to unknown processors
How many hours per month does your team spend redoing AI-generated work that turned out to be wrong? Rework cost is the dark cost of AI. It does not show up on the AI invoice. It shows up as engineer time, marketing time, customer-success time, legal time. It is real money the spreadsheet does not see.
Healthy: below 5% of AI-touched work needs rework Watch: 5–15% rework rate Bleeding: above 15%, AI outputs require full re-verification by a human
How many engineer-hours per month go to fixing AI tool errors that no AI vendor will fix for you? When AI fails in production, somebody pays the debug bill. Usually it is your engineering team, in hours that nobody attributes back to the AI line item. If an engineer is effectively a full-time AI-fixer, that is real headcount cost on the wrong P&L row.
Healthy: below 10 engineer-hours/month on AI debugging Watch: 10–40 hours/month, undetected drift Bleeding: above 40 hours/month, an engineer is effectively an AI fixer
Finance and engineering have a shared spreadsheet showing dollar exposure per question. The CFO knows where the AI bill is going for the first time. Disagreement about "what the AI bill bought" stops being possible because the document exists.
Idle infrastructure (question 2) and tier-mismatch (question 3) are addressed. No architecture changes required. 10 to 20 percent recovered spend, typically. Confirms the audit was directionally right; budget for the harder questions follows.
Caching discipline (question 4), vendor concentration (question 5), auto-renewal renegotiation (question 6) are addressed. Another 15 to 30 percent recovered. These are the ones that need engineering effort and finance buy-in. They are also the ones that compound year over year.
Quarterly P&L shows AI cost flat or declining despite usage growth. The CFO has a defensible narrative for the board chair. The engineering team has a recurring metric they can be measured against. Next quarter the audit gets re-run, and the bleeding questions become watch questions, and the watch questions become healthy.
One email. The PDF, the editable scorecard, and the lever-by-question recovery playbook. No drip sequence, no nurture funnel, no tactics.
Get the cost check