How to Write an AI Incident Runbook Your On-Call Team Will Actually Use

By Mario Alexandre June 21, 2026 sinc-LLM AI Incident Readiness

Why Most AI Runbooks Fail at 3 AM
What Makes an AI Incident Different From a Standard Software Incident
The 6-Section AI Incident Runbook Template
The 55-Hour Problem: What Unstructured Incident Response Really Costs
Six Things to Test in Your Runbook Before You Need It
What Your Vendor's Runbook Should Cover
Conclusion

Why Most AI Runbooks Fail at 3 AM

There is a gap between a compliance document and an operational playbook. A compliance document proves you thought about incident response. An operational playbook tells the engineer who was just paged at 2:47 AM exactly what to do in the next 90 seconds.

Most AI runbooks are compliance documents. They satisfy an audit requirement, live in Confluence, and are written in the past tense of planning: "The team will engage the kill-switch process and coordinate with the vendor." That sentence is useless to an on-call engineer who needs to know the kill-switch location, the access credential, the activation command, and the confirmation signal, in that order, right now.

When the runbook does not answer those questions, the team improvises. Improvisation in an AI incident is expensive in a way that a standard software incident is not, because the failure mode is often invisible until the blast radius is already set. Consider a structural scenario: a tool-calling agent begins producing anomalous outputs at 2:47 AM. The on-call engineer checks the runbook. The kill-switch section reads "contact the platform team." The platform lead is on vacation. The backup is not named. The engineer finds a manual API call in an old Slack thread, uses it to disable the agent, and the audit trail is broken. Forty minutes later, the post-incident review cannot reconstruct what the agent did between 2:47 AM and 3:28 AM because the log preservation step was also not documented.

This article gives you the six sections that separate a runbook your team will reach for from one that proves you thought about it.

Before going further: use the free stability auditor tool to confirm your pipeline's failure modes are observable before you document them in a runbook. A runbook section for a failure mode you cannot detect is not a safety net; it is a false one.

Does your current runbook cover all 12 production controls? The AI Incident Readiness Audit shows you exactly which ones are missing.

Download the 12-Control AI Incident Readiness Audit

What Makes an AI Incident Different From a Standard Software Incident

AI incidents have four properties that standard software incidents do not. Each one requires a runbook section that a generic SRE template does not include.

Silent degradation. A database going offline throws an exception. An AI model serving degraded outputs does not. Output quality collapse, model behavior drift, and prompt injection events all produce semantically wrong responses that look structurally correct to monitoring systems watching for HTTP 500s. The on-call team is often paged by a downstream signal (customer complaints, SLO breach on a business metric) long after the failure began. A runbook that starts at "service is down" is missing the first 40 minutes of an AI incident.

Blast radius ambiguity. OWASP LLM Top 10 (2025) identifies Excessive Agency (LLM06) as the failure mode where a model or agent takes actions beyond intended scope. In a tool-calling agent, the blast radius is set by the time between the first anomalous action and the kill-switch activation, not by the time between the page and the fix. Without documented tool boundary limits and a defined blast radius per tool, the post-incident review cannot answer "what did it do?"

Rollback complexity. Rolling back a code deployment reverses a deterministic change. Rolling back a model version does not: the prompt may have changed, the system context may have changed, and the new model may have introduced behavior changes that were not present in the prior version but are also not present in the current one. The rollback section of an AI runbook must specify the last known good version of both the model and the prompt template, the rollback command, and the verification signal that confirms the rollback succeeded.

Prompt injection as an incident class. As detailed in the adversarial validation patterns guide, behavior change triggered by user input is not a bug in the traditional sense: it is an incident class with its own response procedure. The runbook must include a pre-call validation step and a documented response for injection events that is separate from the general "service degradation" response.

The 6-Section AI Incident Runbook Template

The six sections below map directly to named controls from the 12-Control AI Incident Readiness Audit. The mapping matters: the runbook is the interface your on-call team uses; the audit is the source of truth for whether the controls the runbook invokes actually exist in your production system. A runbook section that points to a control that is not implemented is a false safety net.

Section 1: Kill-Switch Location and Activation Procedure

Maps to /incident-readiness/ Control 1 (Kill-switch). This section exists in the runbook rather than a separate doc because the on-call engineer needs it in the first 60 seconds of an incident, before they have time to look for it.