Skip to main content

Blameless Postmortems That Actually Change Behavior

January 11, 2026By The CTO5 min read
...
frameworks

Most CTOs don't have a postmortem problem. They have a behavior change problem. The doc gets written, the meeting happens, everyone agrees it was a great discussion, and then the same class of incident shows up again 6-10 weeks later.

Most CTOs I talk to don’t have a “postmortem problem.” They have a behavior change problem. The doc gets written, the meeting happens, everyone agrees it was “a great discussion,” and then… the same class of incident shows up again 6–10 weeks later. The team isn’t malicious or lazy. They’re rational. If the system rewards shipping and punishes delay, the organization will treat postmortems as theater unless you wire learning into the way work gets planned, reviewed, and promoted.

A blameless postmortem is supposed to do two things at once: (1) keep psychological safety high enough that people tell the truth, and (2) produce changes that reduce risk. The first part is where most teams stop. The second part is where leadership has to get opinionated. The best framing I’ve found comes from the safety world: incidents are rarely a single “root cause,” they’re a chain of contributing factors and normal workarounds that made sense locally. That’s the heart of the “blameless” idea described in John Allspaw’s “Blameless Postmortems” and the broader systems view in Sidney Dekker’s work on the “New View” of human error. If you want behavior to change, you have to stop treating the postmortem as a narrative and start treating it as an internal product with users: engineers, on-call, product, security, and leadership. The “feature” is fewer repeats; the “UX” is that the next engineer can actually apply what you learned.

Here’s what separates postmortems that change behavior from the ones that don’t: they produce “default changes,” not “reminders.” “Be more careful with migrations” is a reminder. “All migrations require an automated preflight check and a staged rollout template” is a default change. When Google talks about SRE, they’re explicit that toil and recurring failure modes should be engineered away, not managed through heroics (Google SRE book). In practice, that means your action items should bias toward mechanisms: guardrails in CI, safer deployment patterns, better alert routing, runbooks that match reality, and SLOs that force prioritization. A concrete example: if you had an outage because a config change bypassed review, the behavior-changing fix isn’t “tell people to be careful,” it’s “config changes go through the same PR path as code, with required reviewers and automated validation.” If you had an incident because a dependency degraded and you didn’t notice, the fix isn’t “monitor more,” it’s “define an SLO for the dependency boundary and page on error budget burn,” then make that visible to product so the trade-off is explicit.

The leadership side is where this usually breaks. Behavior doesn’t change when action items are “owned by everyone,” when they aren’t scheduled, or when they compete with roadmap work without air cover. I like to run postmortems with a simple constraint: every action item must have an owner, a due date, and a verification method. Verification is the missing piece—“add dashboard” is not verifiable; “reduce MTTR for this failure mode from 45 minutes to under 15 minutes by automating rollback and validating via game day” is. Netflix’s culture around resilience is a good mental model here: they didn’t get value from chaos engineering because they wrote thoughtful docs; they got value because they turned failure into a routine exercise that exposed gaps and forced engineering changes (Netflix TechBlog on Chaos Engineering). If you want postmortems to change behavior, you need a similar loop: learn → change defaults → rehearse → measure.

Actionability comes down to running postmortems like you run product delivery. First, create a lightweight operational system of record—this is where a tool like Incident Postmortem paired with Command Center helps: one place to track incidents, link contributing services, record decisions, and—crucially—track action items to completion with visibility for leadership. Second, adopt a small set of metrics that make repeat failures embarrassing in the right way: repeat-incident rate (same failure mode within 90 days), action-item completion rate (done by due date), and MTTR trend by incident class. Third, enforce a “mechanisms over reminders” rubric in the review: if an action item doesn’t change a default (automation, guardrail, policy, ownership boundary, or SLO), it needs a strong justification. Fourth, reserve capacity: I’ve seen this work with a standing 10–20% reliability budget per team, or a rotating “stability sprint” every 6–8 weeks. If you don’t pre-allocate capacity, you’re betting on willpower, and willpower loses to Q4.

The broader trend is that systems are getting more distributed, more vendor-dependent, and more AI-assisted—meaning the number of plausible failure paths is exploding. That pushes us away from “find the one bug” and toward “design the organization to learn faster than the system changes.” Blameless postmortems are one of the few rituals that touch architecture, operations, and culture at the same time. But they only pay off when you treat them as a governance mechanism: they shape what gets funded, what gets automated, what gets standardized, and what gets taught to new engineers. If your postmortems aren’t changing behavior, don’t ask your teams to write better docs. Ask what your org is currently optimized for—and then rewire the incentives so the safest path is also the easiest path.

Sources: