How to Evaluate an AI Agency's Production Track Record Before You Hire Them

By Mario Alexandre June 21, 2026 sinc-LLM AI Vendor Evaluation

Why Agency Demos Tell You Nothing About Production Readiness
The 10 Evaluation Criteria
How to Use These Criteria Before the First Call
Red Flags That Disqualify an Agency Regardless of Price
The Difference Between Demo-Grade and Production-Grade Work
After You Choose an Agency: What the Contract Must Say
Conclusion

Why Agency Demos Tell You Nothing About Production Readiness

An agency demo is a best-case scenario under controlled conditions. The inputs are curated, the latency is acceptable because the demo server is idle, and nobody has asked what happens at 3 AM when the primary model returns a 500 and there is no documented fallback. The demo answers one question: can this agency build something that looks right? It does not answer the question that costs you money: can they build something that stays right?

Most agency portfolios omit four things that production systems require. They omit monitoring on every critical path. They omit rollback documentation. They omit on-call procedures specific to your integration. And they omit a clear statement of who owns the source code, the model weights, and the audit trail once the engagement ends. These omissions are not accidental. Agencies benefit from vagueness on ownership and exit; the demo benefits from omitting the failure modes.

Consider a structural scenario that illustrates the pattern. A founder signs with an agency after a compelling demo. Six weeks post-launch the pipeline begins returning malformed outputs on a subset of inputs. The agency has no automated drift alert, no rollback path to the prior model version, and the prompt templates are stored in the agency's private repository under their own account. The founder's engineering team cannot diagnose the issue without access to the code. The agency's support response is five business days. The cost of that gap is not in the contract because the founder evaluated the agency on demo quality, not on production controls.

The evaluation criteria below exist to close that gap before you sign, not after.

The left column is what most buyers check. The right column is what governs whether the system survives production. The 10-Point AI Vendor Audit covers all ten items on the right with scored evaluation criteria you can use in any agency screening call.

You have seen what the agency can do on a good day. The 10-Point AI Vendor Audit tells you what happens on a bad day.

Download the 10-Point AI Vendor Audit

The 10 Evaluation Criteria

Each criterion below maps directly to a named control from the sincllm.com 10-Point AI Vendor Audit. The criterion names and descriptions are drawn from that evaluation instrument. For each criterion, the section states what to ask, what a production-grade answer looks like, and what a red-flag answer sounds like.

1. Do they monitor every critical path in production?

Audit Criterion 1: Monitoring on every critical path.

Ask: "Show me a sample of your monitoring dashboard for a production deployment. Which paths are instrumented, and what triggers an alert?" A production-grade answer names specific instrumentation points (inference latency, structured output schema validation, API error rate, model response time), the alerting threshold for each, and the on-call rotation that receives the alert. A red-flag answer: "We'll set up alerts once it's live." Monitoring defined post-launch is monitoring that does not exist when the first incident occurs.

2. Can they show you real SLOs from a deployed system?

Audit Criterion 2: Error budgets and SLOs.

Ask: "What SLOs did you define for your last production deployment, and what was the actual error budget consumption over the first 90 days?" A production-grade answer includes specific numeric SLOs (uptime target, latency p95, error rate ceiling) and a method for tracking error budget burn. A red-flag answer: verbal reliability claims with no numbers, or "We aim for high availability." An agency that cannot cite a past SLO cannot negotiate a meaningful one for your system. The NIST AI RMF GOVERN function identifies supplier capability documentation as a required due diligence control, not an optional trust signal.

3. Who owns the source code, model weights, and audit trail?

Audit Criterion 3: Source-code ownership and audit trail.

Ask: "At go-live, who holds the repository, and what does handover look like? Who owns the fine-tuned model weights, prompt templates, and evaluation datasets?" A production-grade answer: your organization owns the source code, the weights, and the audit trail from day one, held in your own repository under your own account, with the agency granted access rather than holding the primary. A red-flag answer: "proprietary stack," "our platform," or a handover that is deferred to project close. ISO/IEC 42001:2023 supplier relationship controls address procurement documentation requirements that parallel this criterion. Agency exit is engineering-feasible when the contract is structured correctly, as the engineering work at we-fine-tuned-a-7b-model-to-replace-an-api demonstrates. It is not feasible when the agency retains the weights.

4. How do they detect when the model starts behaving differently?

Audit Criterion 4: Drift detection.

Ask: "What automated signal tells you that model outputs have begun to drift from the expected distribution? What is the response protocol when that signal fires?" A production-grade answer names a specific drift detection method: output distribution monitoring, schema validation failure rate, semantic similarity scoring against a reference set, or equivalent. The signal is automated, not human-reviewed on a monthly basis. A red-flag answer: no automated drift signal, or "we review outputs periodically." Drift that is not detected automatically is drift that accumulates silently until a user reports it.

5. What happens when the primary model fails?

Audit Criterion 5: Fallback paths.

Ask: "What is the documented fallback when the primary model returns a 500, a timeout, or a structurally invalid response?" A production-grade answer describes a specific fallback path: a secondary model endpoint, a deterministic rule-based handler, a cached prior response with staleness limits, or a graceful degradation that surfaces a defined error state to the user. The fallback is documented, tested, and deployable without a human intervention step at 3 AM. A red-flag answer: "we'll handle it" with no documented fallback, or "the model is reliable enough that we haven't needed one." No model is reliable enough to run without a fallback path in a production system.

6. How do they detect and alert on cost anomalies?

Audit Criterion 6: Cost-anomaly alarms.

Ask: "What alerts fire before a cost anomaly shows up in the invoice? What is the threshold, and who receives the alert?" A production-grade answer names a specific cost monitoring setup: token usage per request tracked against a baseline, a daily or hourly budget ceiling with an alert threshold below the hard limit, and a named recipient for the alert (not just the agency's internal team). A red-flag answer: no cost monitoring before invoice, or "we track it in the vendor dashboard after the fact." A cost anomaly that surfaces only in an invoice has already been running for the entire billing period. If you are deciding whether to build your own cost monitoring rather than relying on the agency's, the AI Build vs Buy Framework includes cost monitoring as one of its ten evaluation criteria.

7. What is their model-update cadence, and can they roll back to a prior version?

Audit Criterion 7: Model-update cadence and rollback.

Ask: "When you update the model or the prompt, what is the deployment procedure, and how long does a rollback to the prior version take?" A production-grade answer includes: updates deployed through a versioned pipeline with staging validation before production promotion, rollback achievable in under 15 minutes by running a single command or pipeline trigger, and a documented changelog for every model or prompt version in production. A red-flag answer: no version control for the model or prompt, or "we update it as needed and haven't needed to roll back." A system with no rollback capability has no recovery path when an update degrades output quality.

8. What does on-call look like for your integration specifically?

Audit Criterion 8: On-call and incident response.

Ask: "What is the on-call rotation for the specific integration you will build for us? What is the escalation path, and what is the response time SLA for a P1 incident?" A production-grade answer names the on-call engineer or rotation, states a specific response time for a P1 incident (not "business hours"), and describes the escalation path when the first responder cannot resolve the issue. A red-flag answer: generic "business hours" support for a system running 24/7, or "we'll define the support terms in the contract." Support terms defined in the contract after the scope call are terms the agency intends to minimize. OWASP LLM Top 10 (2025) LLM09 (Overreliance) identifies over-dependence on an AI provider without adequate incident controls as a named production risk.

9. How is your data handled, where does it go, and who can access it?

Audit Criterion 9: Data-handling and privacy boundaries.

Ask: "Where does our production data go at inference time? Who at the agency has access to it, and under what conditions is it used for fine-tuning or logged?" A production-grade answer states the data residency region, names the third-party model provider if one is used, describes the logging policy (what is logged, for how long, and under what access controls), and states explicitly whether production data is used for training. A red-flag answer: vague data-residency answers, "it stays with our platform," or confirmation that production data is sent to a third-party API under the agency's account rather than yours.

10. What does a clean handover look like, and what happens if you part ways?

Audit Criterion 10: Documented handover, no lock-in.

Ask: "Walk me through what a complete handover looks like. What do we receive, and what dependencies remain on your tooling, accounts, or infrastructure after the engagement ends?" A production-grade answer includes: a complete repository transfer, documentation of every third-party dependency and how to replace it, the absence of license or API dependencies that require the agency's ongoing participation, and an exit clause in the contract that defines the handover timeline and deliverables. A red-flag answer: no exit clause, dependency on agency-proprietary tooling with no documented alternative, or a handover that is listed as "to be defined" in the proposal. The evaluation of AI vendor lock-in and code ownership at the contract stage is the next step after this evaluation confirms the agency passes the production controls screen.

#	Evaluation Question	Audit Criterion	What Good Looks Like	Red Flag Answer
1	Do they monitor every critical path?	Monitoring on every critical path	Named instrumentation points, alerting thresholds, on-call rotation	"We'll set up alerts once it's live"
2	Can they show real SLOs?	Error budgets and SLOs	Numeric SLOs from a prior deployment, error budget tracking method	Verbal reliability claims, no numbers
3	Who owns source code, weights, audit trail?	Source-code ownership and audit trail	Client owns repository, weights, and audit trail from day one	"Proprietary stack" or delayed handover
4	How do they detect model drift?	Drift detection	Automated drift signal, named method, documented response protocol	No automated drift signal; periodic human review only
5	What is the fallback when the model fails?	Fallback paths	Documented fallback path, tested, deployable without manual step	"We'll handle it" with no documented fallback
6	How do they detect cost anomalies?	Cost-anomaly alarms	Automated alert before invoice, named threshold and recipient	No cost monitoring before invoice
7	What is the model-update and rollback procedure?	Model-update cadence and rollback	Versioned pipeline, staging validation, rollback under 15 minutes	No version control for model or prompt
8	What does on-call look like for this integration?	On-call and incident response	Named on-call rotation, specific P1 response time, escalation path	Generic business-hours support for a 24/7 system
9	Where does our data go and who can access it?	Data-handling and privacy boundaries	Named residency region, access controls, logging policy, no unauthorized training use	Vague data-residency answers or third-party sharing under agency account
10	What does a clean handover look like?	Documented handover, no lock-in	Complete repository transfer, documented dependencies, exit clause in contract	No exit clause, dependency on agency-proprietary tooling

// Free · 10-Point Audit

Know what you are buying before you sign.

The 10-Point AI Vendor Audit translates these questions into a repeatable production-engineering checklist: source-code ownership, audit trail, SLOs, fallback paths, and exit clause. Free 16-page PDF, 15 minutes per vendor.

→ Get the 10-Point AI Vendor Audit

How to Use These Criteria Before the First Call

The most common mistake in agency evaluation is asking these questions live during a call, where the agency can answer verbally and move on before you have time to probe. A better procedure has three steps.

Step 1: Send written questions 48 hours in advance. Email each agency on your shortlist the 10 evaluation questions before the screening call. Frame them as your standard evaluation process, not an accusation. A production-grade agency will recognize these as standard engineering procurement questions. An agency that pushes back on providing written answers is telling you something about how they handle documentation in production.

Step 2: Require written answers, not verbal explanations on the call. Verbal answers are not verifiable and not comparable across agencies. Written answers create a record, force specificity, and give your technical team time to assess the quality of each response before you reconvene. A written answer that says "we will define monitoring thresholds after go-live" is the same red flag as a verbal one, but it is in writing where you can reference it.

Step 3: Score each agency on the same rubric. Use the scorecard below to assign a 1 to 3 score per criterion across every agency on your shortlist. Score all agencies on the same pass, not sequentially. Recency bias in sequential evaluation causes the most recent impressive demo to displace the memory of a better answer from a prior agency. The scorecard makes the comparison objective.

#	Question	Agency Response Summary	Score (1-3)
1	Monitoring on every critical path?
2	Real SLOs from a deployed system?
3	Source-code, weights, audit trail ownership?
4	Automated drift detection method?
5	Documented fallback path?
6	Cost-anomaly alert before invoice?
7	Model-update cadence and rollback procedure?
8	Named on-call for this integration?
9	Data residency, access controls, logging policy?
10	Documented handover and exit clause?

Score 3 for a specific, measurable, documentable answer. Score 2 for a partially specific answer that defers one element to post-contract. Score 1 for a vague or verbal-only answer. Any criterion scored 1 is a disqualifying flag unless the agency can provide documentation within 48 hours of the call.

Red Flags That Disqualify an Agency Regardless of Price

Red Flag	What It Signals	Which Criterion It Violates
"We'll set up monitoring / alerts / SLOs after go-live."	Production controls will not exist when the first incident occurs. The agency does not have a production engineering discipline; they have a delivery discipline.	Criteria 1, 2, 8
"Our platform handles all of that; you don't need to worry about it."	You will have zero visibility into the controls that govern your production system. This is the architecture of lock-in.	Criteria 3, 4, 10
"We've never had a major incident with a client."	Either untrue, unverifiable, or evidence that they have never operated a system complex enough to have incidents. The claim cannot be validated and does not substitute for documented fallback paths.	Criterion 5
"The contract will cover support; let's align on scope first."	Support terms deferred to the contract stage will be minimized at negotiation time, when the agency has already won the engagement. On-call and incident response must be scoped before you sign.	Criterion 8
"We use the latest models and update automatically."	Automatic model updates with no versioning, staging validation, or rollback path mean any model provider update can degrade your production outputs without warning and without a recovery path.	Criterion 7

The Difference Between Demo-Grade and Production-Grade Work

The evaluation criteria above are not theoretical. They reflect the controls required to run a production AI pipeline reliably over time. Sincllm's own production benchmark on sr-demo-ai.com is 99% pipeline reliability across 500+ transcripts. That benchmark was not achieved by building a good demo. It required monitoring on every inference call, schema validation on every output, a fallback path for malformed responses, and structured output discipline applied consistently across the full transcript set. The 55 hours per month recovered for one client engagement came from the same foundation: not prompt quality alone, but pipeline reliability that prevented silent failures from accumulating into rework cycles.

My background is seven years of electrical engineering in Luanda and a BSEE from the University of South Florida. The evaluation framework in this article borrows from engineering procurement discipline, not agency pitch culture. In electrical engineering, a supplier's production track record is evaluated before specification, not after. The component data sheet and the failure mode analysis are required inputs to the design, not optional due diligence. The same logic applies to AI agency selection: the production controls are the data sheet. The demo is the sales brochure.

The NIST AI RMF GOVERN function specifically identifies third-party supplier due diligence as a required organizational control for AI risk management. ISO/IEC 42001:2023 supplier relationship requirements address procurement documentation for AI management systems. Neither standard treats a demo and a portfolio as sufficient due diligence for a production AI engagement. OWASP LLM Top 10 (2025) names LLM06 (Excessive Agency) and LLM09 (Overreliance) as production risks that a poorly structured agency engagement directly introduces when controls like fallback paths, drift detection, and on-call are left undefined.

After You Choose an Agency: What the Contract Must Say

The evaluation framework above is the pre-hire screening layer. Once an agency passes the 10-criterion screen, the next step is ensuring the contract encodes the answers you received. An agency that gave good verbal answers to ownership and exit clause questions must have those terms in the contract. An agency that deferred one item to "post-contract alignment" must resolve that deferral before you sign.

The two articles that cover the contract stage in detail are AI vendor contract questions (the specific questions to negotiate at the contract stage) and AI vendor lock-in and code ownership (the exit-risk dimension, including what happens to the weights and the codebase if the engagement ends early). Both cover the contract-signing stage; this article covers the pre-hire evaluation stage. They are designed to be used in sequence.

// Free · 10-Point Audit

Know what you are buying before you sign.

→ Get the 10-Point AI Vendor Audit

Conclusion

The question "what is your production track record?" is the right question to ask an AI agency before you hire them. The 10 evaluation criteria in this article make that question answerable before you sign a contract: monitoring on every critical path, error budgets and SLOs, source-code ownership and audit trail, drift detection, fallback paths, cost-anomaly alarms, model-update cadence and rollback, on-call and incident response, data-handling and privacy boundaries, and documented handover with no lock-in.

A production-grade agency will not be surprised by these questions. They will have specific, documented, verifiable answers to each one. An agency that cannot answer them has not built production systems. They have built demos. The difference between those two things is the cost you pay when the system fails at 3 AM and there is no rollback, no alert, and no on-call engineer who knows the specific integration you are running.

// 30-Minute Production Review

Bring your current AI setup. We will tell you what is production-ready and what is not.

A focused 30-minute audit call with a production AI engineer (7 years EE, BSEE University of South Florida, sincllm-mcp v2.0.0 in production). No pitch deck. You bring the architecture; we bring the checklist.

→ Book the 30-Minute Production Review

How to Evaluate an AI Agency's Production Track Record Before You Hire Them

Table of Contents

Why Agency Demos Tell You Nothing About Production Readiness

The 10 Evaluation Criteria

1. Do they monitor every critical path in production?

2. Can they show you real SLOs from a deployed system?

3. Who owns the source code, model weights, and audit trail?

4. How do they detect when the model starts behaving differently?

5. What happens when the primary model fails?

6. How do they detect and alert on cost anomalies?

7. What is their model-update cadence, and can they roll back to a prior version?

8. What does on-call look like for your integration specifically?

9. How is your data handled, where does it go, and who can access it?

10. What does a clean handover look like, and what happens if you part ways?

Know what you are buying before you sign.

How to Use These Criteria Before the First Call

Red Flags That Disqualify an Agency Regardless of Price

The Difference Between Demo-Grade and Production-Grade Work

After You Choose an Agency: What the Contract Must Say

Know what you are buying before you sign.

Conclusion

Related Articles

Bring your current AI setup. We will tell you what is production-ready and what is not.