How to Write an AI Incident Runbook Your On-Call Team Will Actually Use

By Mario Alexandre June 21, 2026 sinc-LLM AI Incident Readiness

Why Most AI Runbooks Fail at 3 AM

There is a gap between a compliance document and an operational playbook. A compliance document proves you thought about incident response. An operational playbook tells the engineer who was just paged at 2:47 AM exactly what to do in the next 90 seconds.

Most AI runbooks are compliance documents. They satisfy an audit requirement, live in Confluence, and are written in the past tense of planning: "The team will engage the kill-switch process and coordinate with the vendor." That sentence is useless to an on-call engineer who needs to know the kill-switch location, the access credential, the activation command, and the confirmation signal, in that order, right now.

When the runbook does not answer those questions, the team improvises. Improvisation in an AI incident is expensive in a way that a standard software incident is not, because the failure mode is often invisible until the blast radius is already set. Consider a structural scenario: a tool-calling agent begins producing anomalous outputs at 2:47 AM. The on-call engineer checks the runbook. The kill-switch section reads "contact the platform team." The platform lead is on vacation. The backup is not named. The engineer finds a manual API call in an old Slack thread, uses it to disable the agent, and the audit trail is broken. Forty minutes later, the post-incident review cannot reconstruct what the agent did between 2:47 AM and 3:28 AM because the log preservation step was also not documented.

This article gives you the six sections that separate a runbook your team will reach for from one that proves you thought about it.

Before going further: use the free stability auditor tool to confirm your pipeline's failure modes are observable before you document them in a runbook. A runbook section for a failure mode you cannot detect is not a safety net; it is a false one.

Does your current runbook cover all 12 production controls? The AI Incident Readiness Audit shows you exactly which ones are missing.

Download the 12-Control AI Incident Readiness Audit

What Makes an AI Incident Different From a Standard Software Incident

AI incidents have four properties that standard software incidents do not. Each one requires a runbook section that a generic SRE template does not include.

AI Incident Type to Runbook Section Decision Flow INCIDENT DETECTED Silent drift? Hard error? Blast radius? Injection event? Vendor outage? PRIMARY SECTION Section 3: Rollback Section 1: Kill-switch Section 2: Tool Boundary Section 1 + Section 5 Section 4: Escalation FIRST ACTION Compare to last good version Activate kill-switch, log time Enumerate tool calls in window Preserve audit trail first Page escalation contact

Silent degradation. A database going offline throws an exception. An AI model serving degraded outputs does not. Output quality collapse, model behavior drift, and prompt injection events all produce semantically wrong responses that look structurally correct to monitoring systems watching for HTTP 500s. The on-call team is often paged by a downstream signal (customer complaints, SLO breach on a business metric) long after the failure began. A runbook that starts at "service is down" is missing the first 40 minutes of an AI incident.

Blast radius ambiguity. OWASP LLM Top 10 (2025) identifies Excessive Agency (LLM06) as the failure mode where a model or agent takes actions beyond intended scope. In a tool-calling agent, the blast radius is set by the time between the first anomalous action and the kill-switch activation, not by the time between the page and the fix. Without documented tool boundary limits and a defined blast radius per tool, the post-incident review cannot answer "what did it do?"

Rollback complexity. Rolling back a code deployment reverses a deterministic change. Rolling back a model version does not: the prompt may have changed, the system context may have changed, and the new model may have introduced behavior changes that were not present in the prior version but are also not present in the current one. The rollback section of an AI runbook must specify the last known good version of both the model and the prompt template, the rollback command, and the verification signal that confirms the rollback succeeded.

Prompt injection as an incident class. As detailed in the adversarial validation patterns guide, behavior change triggered by user input is not a bug in the traditional sense: it is an incident class with its own response procedure. The runbook must include a pre-call validation step and a documented response for injection events that is separate from the general "service degradation" response.

The 6-Section AI Incident Runbook Template

The six sections below map directly to named controls from the 12-Control AI Incident Readiness Audit. The mapping matters: the runbook is the interface your on-call team uses; the audit is the source of truth for whether the controls the runbook invokes actually exist in your production system. A runbook section that points to a control that is not implemented is a false safety net.

Section 1: Kill-Switch Location and Activation Procedure

Maps to /incident-readiness/ Control 1 (Kill-switch). This section exists in the runbook rather than a separate doc because the on-call engineer needs it in the first 60 seconds of an incident, before they have time to look for it.

Required documentation fields:

Failure mode when missing: The on-call engineer pages the "platform team" at 3 AM, gets no response for 20 minutes, and improvises a shutdown via a method that breaks the audit trail. The post-incident review has no record of when the agent was actually stopped.

Section 2: Tool Boundary Documentation and Blast Radius Limits

Maps to /incident-readiness/ Control 2 (Tool boundary docs). This section documents what the agent can do, not what it should do. The distinction matters during an incident when the team needs to reconstruct what it actually did.

Required documentation fields:

sincllm-mcp v2.0.0 is sincllm's own production MCP deployment with 12 scoped tools. Each tool in that deployment has a defined scope limit and a pre-call validation gate, which means the blast radius per tool call is bounded before the tool executes, not after the incident is detected. That pre-call validation pattern, documented in the adversarial validation guide, belongs in this runbook section.

Failure mode when missing: The on-call team knows the agent misbehaved but cannot reconstruct which tools it invoked or in what sequence. The blast radius is unknown. The post-incident review takes days instead of hours.

Section 3: Rollback Procedure for Model and Prompt Versions

Maps to /incident-readiness/ Control 9 (Rollback). Model rollback is not code rollback. This section must be tested before an incident, not discovered during one.

Required documentation fields:

Failure mode when missing: The team attempts a rollback during the incident and discovers the rollback procedure has never been tested. The prior model version is no longer available in the deployment system. The incident extends from 40 minutes to 4 hours.

Section 4: Escalation Path and Communication Protocol

Maps to /incident-readiness/ Control 8 (Eval coverage) and the audit criterion 8 (On-call and incident response) from the 10-Point AI Vendor Audit. Escalation paths that name roles, not individuals, break at 3 AM when the role is unoccupied.

Required documentation fields:

Failure mode when missing: The on-call engineer spends 25 minutes deciding who to wake up. The customer communication is improvised. The incident summary written the next morning reconstructs events from memory rather than from a timestamped record.

Section 5: Audit-Trail Check and Evidence Preservation

Maps to /incident-readiness/ Control 3 (Audit-trail completeness). Evidence preservation is the first action in most AI incidents, not the last. A rollback that erases the audit trail makes the post-incident review impossible.

Required documentation fields:

Failure mode when missing: The on-call team executes the rollback without exporting logs. The rollback succeeds. The incident is "resolved." The post-incident review three days later discovers that the audit trail for the incident window was overwritten by the rollback. The root cause cannot be determined.

Section 6: Post-Incident Review and Eval Coverage Check

Maps to /incident-readiness/ Control 8 (Eval coverage). The post-incident review is not a retrospective; it is a production control update. Every AI incident that is not captured in an eval case will recur.

Required documentation fields:

Failure mode when missing: The incident is resolved. The postmortem is written. The same failure mode recurs 6 weeks later because no eval case was added and no runbook update was made.

// Free · 12-Control Audit

Can your AI system survive a 3 AM incident?

The 12-Control AI Incident Readiness Audit covers kill-switch, tool boundary docs, audit-trail completeness, sandbox separation, prompt-injection defenses, and rollback. Free PDF, verified against production engineering practice.

→ Get the 12-Control Incident Readiness Audit

The six sections above are your runbook template. The full reference tables follow.

Table 1: 6-Section AI Incident Runbook Template

Section Section Name Incident Readiness Control Required Documentation Fields Failure Mode When Missing
1 Kill-Switch Location and Activation Procedure Control 1: Kill-switch Location, credential vault path, activation steps, confirmation signal, primary owner (named), backup owner (named) Team improvises shutdown via undocumented method; audit trail breaks; kill-switch time unknown
2 Tool Boundary Documentation and Blast Radius Limits Control 2: Tool boundary docs Tool inventory, blast radius per tool, tool call log location, scope enforcement mechanism Post-incident review cannot reconstruct what the agent did; blast radius unknown
3 Rollback Procedure for Model and Prompt Versions Control 9: Rollback Last known good model version, last known good prompt SHA, rollback command, verification signal, last test date Rollback attempted during incident for the first time; prior version unavailable; incident extends 4 hours
4 Escalation Path and Communication Protocol Control 8: Eval coverage (escalation sub-item) Severity thresholds, named escalation contacts with out-of-band methods, SLO breach trigger, customer communication template 25 minutes spent deciding who to wake up; customer communication improvised; timeline reconstructed from memory
5 Audit-Trail Check and Evidence Preservation Control 3: Audit-trail completeness Log location, retention window, evidence preservation command, state snapshot procedure Rollback executed before logs exported; audit trail overwritten; root cause undetermined
6 Post-Incident Review and Eval Coverage Check Control 8: Eval coverage Review template (48h), eval coverage trigger, eval case owner, eval coverage report location No eval case added; same failure mode recurs in 6 weeks

Table 2: AI Incident Type to Runbook Section Mapping

Incident Type Primary Runbook Section Secondary Section First Action
Model behavior drift Section 3: Rollback Section 6: Post-incident review Compare current output sample to last known good version output; confirm drift is measurable before rolling back
Prompt injection event Section 1: Kill-switch Section 5: Audit trail Preserve audit trail before any remediation; activate kill-switch; log activation time to the second
Tool-call blast radius Section 2: Tool boundary Section 5: Audit trail Pull tool call log for the incident window; enumerate every tool invocation before estimating impact
Output quality collapse Section 3: Rollback Section 4: Escalation Confirm quality collapse is model-side (not data-side) before triggering rollback; page escalation contact simultaneously
Vendor API outage Section 4: Escalation Section 3: Rollback (to fallback model) Verify outage via vendor status page; page escalation contact; execute fallback model procedure if SLO window is at risk
Cost spike (unexpected inference volume) Section 1: Kill-switch Section 4: Escalation Activate kill-switch on the offending pipeline; page escalation contact with cost delta and affected pipeline name

The 55-Hour Problem: What Unstructured Incident Response Really Costs

The inverse of an efficient AI operation is instructive. In one client engagement, sincllm's own production work recovered 55 hours per month by systematically documenting and operationalizing AI pipeline controls. That figure comes from a single engagement on sr-demo-ai.com and is not a guarantee of outcomes for any other team or system.

The inverse framing applies directly to incident response: teams that have no usable runbook spend equivalent time improvising response and rebuilding state after incidents. That cost is not just the duration of the incident itself. It compounds across three dimensions.

First, every improvised response produces undocumented tribal knowledge. The engineer who found the manual API call at 3 AM now carries that knowledge in their head. When they leave the team, it leaves with them.

Second, every incident without a post-incident eval coverage update leaves a gap in the test suite. The failure mode that caused the incident is now a known unknown that no automated check will catch. The next occurrence of the same failure mode will look like a new incident.

Third, every blast radius that cannot be reconstructed requires manual review of downstream systems that may have acted on the agent's anomalous output. That review is labor, and it is unbounded: the team does not know when they have reviewed enough until they have confirmed every system the agent touched.

The AI system stability framing that explains why escalation thresholds (not just escalation paths) must be defined in a runbook applies directly here: a threshold that is documented before the incident is executed in seconds; a threshold that must be derived during the incident requires deliberation under pressure, which takes minutes the SLO window may not have.

Six Things to Test in Your Runbook Before You Need It

A runbook that has never been tested is a plan, not a procedure. Run these six drills on a quarterly schedule. Each has a pass condition.

The functional safety engineering standard that production AI runbooks borrow their fault-containment discipline from treats testing procedures as part of the safety case, not as optional verification. A kill-switch that has never been tested under controlled conditions is not a safety control; it is a hypothesis.

What Your Vendor's Runbook Should Cover (and How to Find Out If It Exists)

Your runbook is only as effective as your vendor's incident response documentation. If your AI vendor cannot be killed-switched by your team independently, the kill-switch section of your runbook is incomplete. If your vendor does not provide tool call logs for your agent's production activity, the audit trail section of your runbook cannot be executed.

The question "does our vendor have a runbook?" is answered by asking the vendor directly for their on-call and incident response documentation. A vendor with a production-grade AI system can provide the following without a multi-week procurement request:

A vendor who cannot answer these questions in writing does not have an operational incident response plan. Their SOC 2 certification documents the existence of controls. It does not prove those controls are executable by your on-call team at 3 AM.

The criterion 8 (On-call and incident response) from the 10-Point AI Vendor Audit translates these questions into a repeatable evaluation checklist. Use it before you sign, not after the first incident.

// Free · 10-Point Audit

Know what you are buying before you sign.

The 10-Point AI Vendor Audit translates these questions into a repeatable production-engineering checklist: source-code ownership, audit trail, SLOs, fallback paths, and exit clause. Free 16-page PDF, 15 minutes per vendor.

→ Get the 10-Point AI Vendor Audit

Conclusion

The six sections above are not a compliance framework. They are the minimum documentation your on-call team needs to execute a response procedure rather than improvise one. A runbook that covers kill-switch location, tool boundary limits, rollback procedure, escalation path, audit trail preservation, and post-incident eval coverage is a runbook the team can reach for at 3 AM without a Slack thread and a vacation-mode auto-reply.

The NIST AI Risk Management Framework (AI RMF 1.0) RESPOND and RECOVER functions establish the operational documentation requirements that incident runbooks fulfill. The EU AI Act (Regulation 2024/1689) places obligations on high-risk AI system operators to maintain technical documentation and post-market monitoring plans, which runbooks operationalize. The six sections above are where those documentation obligations become executable procedures.

The 12 controls from the AI Incident Readiness Audit go further than the six runbook sections. Controls 4 through 7 and 10 through 12 cover sandbox separation, secret access scope, prompt-injection defenses, pre-tool-call gate, production data isolation, vendor breach exposure, and failure-mode visibility. These are the controls that must exist in your production system before your runbook can invoke them. The free AI incident readiness checklist is the fastest way to find out which ones are missing.

// Free · 12-Control Audit

Can your AI system survive a 3 AM incident?

The 12-Control AI Incident Readiness Audit covers kill-switch, tool boundary docs, audit-trail completeness, sandbox separation, prompt-injection defenses, and rollback. Free PDF, verified against production engineering practice.

→ Get the 12-Control Incident Readiness Audit