LiveStatus
Back to blog
#incident-response#engineering#sre

What 500 errors teach you about incident communication

Silence is the worst outage response. Here's the 5-minute rule, the severity vs impact split, and how to write updates that don't read like a legal document.

Andrew Leonenko··9 min read

Most outages aren't actually that bad. Something breaks, a smart person figures it out, the fix goes in, life continues. What turns an outage into a trust crisis isn't the outage itself. It's how you talk about it while it's happening.

I've spent a long time reading postmortems for fun, and a pattern shows up every time. The companies that come out of a bad day looking better are not the ones with the shortest downtime. They're the ones whose updates were clear, frequent, and honest. That's it. The ones who come out looking worse are the ones who went quiet.

So here's a small pile of rules I've internalized after watching this play out for years, plus the famous incidents that taught them to me.

Rule 1: silence is the worst response

Imagine you're a customer. Your app is broken. You refresh the vendor's status page. It says "All systems operational" in green. What do you think?

Options:

  1. "Must be on my end."
  2. "Is the status page broken too?"
  3. "These people are lying to me."

None of those are good for the vendor. All three happen within 60 seconds of a blank status page.

The single best thing you can do when something's broken is post something immediately, even if you don't know what's wrong. "We're seeing elevated error rates on the API, investigating" is enough. It turns the customer's brain from "am I going crazy" into "ok, they know, I can get back to my day". That's a huge emotional delta for one sentence of writing.

The textbook example of getting this right was GitLab in 2017, when an engineer accidentally ran rm -rf on the production database and they lost about six hours of data. It was as bad as an outage can get. But GitLab live-streamed the recovery, published a real-time Google Doc of what they were doing, and posted continuous updates on their status page. They came out of it more trusted than they went in. Users praised them. The lesson wasn't "don't delete your database", it was "communicate like crazy when things are bad".

Rule 2: the 5 minute rule

This one's simple. No matter what is happening, you post an update every 5 minutes during an active sev-1, even if the update is "still investigating, no new info".

Five minutes feels too fast. It isn't. When someone's app is down, five minutes of silence feels like an hour. You are not writing for the person who checks the page once. You're writing for the person hitting F5 every 30 seconds, increasingly angry.

The trick is that "still investigating" is a perfectly good update. You don't need new information to post. You just need to reaffirm that a human is still on the problem. Silence implies abandonment. A boring update implies effort.

For sev-2 and below you can stretch it. Every 15 minutes is fine. For sev-3 you can probably get away with updating once every hour. But for anything customer-facing that's actually broken, the cadence has to be tight.

Cloudflare's June 2022 BGP incident is another good example of cadence. Their dashboard and many dependent sites were down for about 90 minutes. Cloudflare posted updates every few minutes the entire time. Clear, technical, with rolling ETAs. People complained about the outage. Nobody complained about the communication. That's the goal.

Rule 3: separate severity from impact

This is the one most teams get wrong.

Severity is how bad the thing is for your infrastructure. Sev-1, sev-2, sev-3. It's your internal classification.

Impact is how many customers notice, and in what way. That's a completely different question.

Example: you lose a read replica. Sev-2 internally, on-call is paged, the team is scrambling to failover. But customers are totally fine because writes are still going to the primary and reads have fallen back. Impact: zero. You shouldn't be posting a customer-facing update for this. Send it to an internal channel. Your status page is for customer impact, not for internal drama.

Another example: your background email worker is backed up by 30 minutes. Sev-3 internally, barely worth a page. But every customer who signs up in that window is getting a delayed verification email and some of them are bouncing off your funnel. Impact: high. You should absolutely be posting a status page update about it, maybe even emailing affected users directly.

The lesson is that your status page audience doesn't care about your severity system. They care about whether their specific use of your product is currently working. Write for them.

The AWS S3 outage in February 2017 (the one where a typo took down a chunk of the internet for about four hours) was a case where severity and impact diverged wildly inside AWS. Internally it was a handful of services in one region. For the outside world it broke Slack, Trello, Docker Hub, Quora, and a thousand smaller services that had hardcoded us-east-1. AWS's status dashboard famously couldn't even update because its own status icons were served from S3. The status page was green while half the internet was red. This is the cautionary tale. If your status page depends on the thing that's broken, you don't have a status page, you have a monument.

Rule 4: write like a human, not a lawyer

Read any of the worst incident updates you've ever seen. I'll bet most of them sounded like this:

"We are currently experiencing an issue that may be impacting some users. Our engineering team is engaged and working to resolve the issue. We appreciate your patience and will provide further updates as they become available."

This is legal-coded hedge copy. "May be impacting". "Some users". "Engaged". Every phrase is designed to reduce corporate liability. It does so by saying nothing. Customers read this and correctly conclude that you're more worried about lawyers than about them.

Compare with the same update written like a human:

"Checkout is broken for customers in Europe. We think it's a database failover issue. Engineering is on it, next update in 5 minutes."

Specific. Honest. Says who's affected, what's broken, what's being done, when to expect more. It's shorter and better in every way. The only thing it's not doing is covering your butt in a hypothetical future lawsuit, and if you actually end up in that lawsuit, vague corporate hedging is not going to save you anyway.

Some specific rules that help:

  • Name the affected component. "Checkout", "the API", "mobile push notifications". Don't say "some services".
  • Name the affected audience when you can. "EU customers", "users on Safari", "anyone who signed up after 3pm". Vague "some users" makes it sound like you have no idea.
  • Give a next-update time. "More in 5 minutes" is a promise. If you don't post again in 5 minutes, you broke the promise. That's good. It forces you to keep writing.
  • Don't apologize in every update. One apology at the top is fine. Repeating "we're sorry for the inconvenience" in every five-minute update dilutes the word. Save sincerity for the postmortem.
  • Don't speculate about root cause before you have it. "We believe the issue is related to a bad deploy" is fine if you just rolled back. "We believe the issue is related to AWS" is not fine if you have no evidence. You'll be wrong in front of thousands of people.

Rule 5: the postmortem is part of the incident

An outage isn't over when the fix goes in. It's over when you publish the postmortem. Until then, it's sitting in customers' memories as "that thing that broke", without any resolution.

A good postmortem has four parts:

  1. What happened, in plain language, on a timeline.
  2. Why it happened, including the real root cause and any contributing factors. Blameless, but specific.
  3. What we did about it, short-term (the fix) and long-term (the prevention work).
  4. What we're changing so it doesn't happen again.

That last bullet is where most teams bail out with vague "we're investing in reliability" handwaving. Don't. Be specific. "We're adding canary deploys for the payment service by end of month." "We're writing a runbook for the BGP failover." "We're deleting the cleanup.sh script that caused this because nobody should ever run it again." Concrete commitments rebuild trust.

GitLab's 2017 postmortem is still the gold standard here. They published a full timeline, named the engineer who ran the command (with consent), explained the five layers of backup that had all silently failed, and committed to specific changes. It's uncomfortable to read even years later. That's the point. Trust is built in specificity.

Rule 6: default to public

The temptation during a bad incident is to go dark. Smaller update radius, only notify enterprise customers, soft-pedal the public status page. This instinct is always wrong.

Customers will find out. Twitter exists. Downdetector exists. If you try to keep a bad outage quiet, you get two bad outcomes instead of one: the outage itself, plus the story about how you tried to cover it up. Pick one.

The only exception is security incidents where public disclosure genuinely endangers users, which is a narrow category and you need a lawyer in the room for it. For everything else: post publicly, post fast, post often.

Where the tooling helps

This is all process, not product. But product can nudge you in the right direction or out of it. A few things we built into LiveStatus to enforce this stuff:

  • Next-update reminders. Set a 5-minute update cadence when you open an incident, the app reminds you if you miss it.
  • Templates for common incidents so you can post "investigating" in under 10 seconds instead of staring at a blank box.
  • Subscriber channel choice, so people who want email get email, people who want SMS get SMS, and people who installed the app get push. You don't have to pick one and abandon the others.
  • Public incident URL per incident so you can link it from Twitter, Slack, support tickets, everywhere. Permanent link, won't move.

None of this replaces the discipline of just writing the damn update. But shaving seconds off "time from realizing there's a problem to telling customers about it" is worth a lot during a bad day.

The short version

If you only remember one thing from this: silence is the worst response. Post something within 60 seconds of knowing, even if it's just "we see it, we're on it". Then stay loud until it's fixed. The worst outages become forgettable when communication is good. The best infrastructure becomes legendary in the bad way when communication is bad.

Make sure your status page isn't the thing that broke.

More posts