Most AI agencies ship demos. Few engineer for production. This is the 10-point checklist that tells you which is which, in 15 minutes, before you commit a six-figure budget. Built from BSEE-grade engineering: redundancy, fault tolerance, monitoring, ownership.
For a while, I assumed the AI agency landscape was uniform. They all ship working demos. They all promise integrations. They all show charts at the kickoff. Picking one was a coin flip on price and personality.
Then I watched a $50K AI project break in week 3 because nobody had built monitoring. The agency had wired a Claude prompt to a Stripe webhook and called it shipped. When the Stripe API rate-limited under real volume, the project went silent. No alarm. No fallback. No log of what failed. Just a stack of refund tickets and a confused operations team.
That's when I realized the agencies-are-the-same story is wrong, and the fix isn't shopping harder. It's learning to spot the engineering layer the agency doesn't talk about. Production AI has a structural shape. Demos don't. The 10 points below make that shape visible. Run them on any vendor in 15 minutes. The ones who pass are the ones engineered for production.
Each criterion is a yes/no with a single specific question to ask. If your vendor can't answer it in plain English, that's the fail signal. You don't have to evaluate the answer's merit, you have to notice whether they have an answer at all.
Production AI fails silently more often than it fails loudly. Without monitoring, the first signal is angry customers. Every external API call, every model invocation, every webhook trigger needs an observability hook.
Ask: "Show me the dashboard for the last 24 hours of production traffic. If you can't, that's a fail."
Engineering teams that work in production write down what "acceptable failure" looks like. Without a stated availability target, you can't tell whether an outage is normal or not, and you can't decide when to roll back.
Ask: "What is the system's stated availability target, and what happens operationally when it's breached?"
If you can't see the code that runs your workload, you don't own it. Period. No "we'll send a copy on request". No "it's hosted on our platform". Real ownership means git access today, not a promise.
Ask: "Walk me through the git history of the production deploy. Who committed what, when?"
An AI system that worked on a demo dataset can quietly degrade as real input distributions shift. Without drift detection, you discover the degradation through customer complaints. Through quality-of-service tickets. Through churn.
Ask: "How does the system detect that the distribution of inputs has shifted from training time?"
Every LLM API will return 5xx eventually. Every model deprecation will eventually happen. Production code knows what to do when the primary call fails. Demo code throws an exception and stops.
Ask: "Trace the code path that runs when OpenAI or Anthropic returns a 5xx. Show me the line."
A misconfigured retry loop or a bug in a prompt template can multiply your monthly bill 10x in an afternoon. You need a hard ceiling, an alert on unusual spend, and a kill switch. Not on next month's invoice. Right now.
Ask: "What dollar threshold of unexpected spend triggers an alert, and to whom?"
Anthropic ships a new Claude every few months. OpenAI ships a new GPT every few months. Each is a behavioral change. Without a documented evaluation procedure, model updates either ship blind or never ship at all.
Ask: "When the next major model ships, what's the documented procedure for evaluating, deploying, and rolling back?"
If your AI system goes down at 2 AM on a Saturday, somebody has to wake up. If nobody is named, nobody is responsible. The agency that says "we monitor business hours" is telling you what happens at 2 AM.
Ask: "Who gets paged when the system goes down at 2 AM, and what's the documented response time?"
Your customer data flows through the AI vendor's stack into model APIs and back. Each hop is a privacy decision. Without a clear documented map, your compliance posture is a guess.
Ask: "What customer data leaves your infrastructure, where does it go, and how is it handled at each hop?"
Every consulting engagement ends. The good ones leave you operating, with full source, deployment scripts, runbooks, credentials. The bad ones leave you locked out of a dashboard you can't access without the vendor's login.
Ask: "If I terminate our contract tomorrow, exactly what do I receive, and what do I keep operating?"
Not "I have a bad feeling about this". Specific gaps, in writing, that you brought to a vendor meeting and they couldn't fill. That's a different conversation than "is this working".
The audit shifts the conversation from "tell me about your AI" to "show me your monitoring dashboard". Vendors who can answer the second question want different clients than vendors who only handle the first.
If something breaks 6 months from now, the audit notes are the difference between "we never asked" and "we asked, and they said it was handled". That difference matters legally, contractually, and politically inside your organization.
Each failed criterion is a specific gap with a specific fix. That's not a "we need to rebuild everything" panic. It's a prioritized list, sized by effort, that your engineering team can execute against. Or that you can hand to me, if you want it done.
One email. The PDF, the editable scorecard, and the list of follow-up questions for vendor meetings. No drip sequence, no nurture funnel, no tactics.
Get the audit