Retro: Crowdstrike outage

Bullet points from video

There’s no real financial or criminal consequences for outages like this
- Stock market may or may not reflect such outages, but that’s about it
- Mental note to check out the aftermath of other big outages in the past.
  - How are we defining big?

Such risks can be handled at a software, deployment, and testing level.

Canary deployments
Staged rollouts
Rollback strategy
Separate high-risk operations into a separate process, that way your whole program doesn’t crash
- Related reading to the point above is an integration pattern: Anti-corruption layer.
Sanity check files before you load them into your program. How this is done is very context dependent.
Write integration tests that force errors, and make sure they’re handled correctly
- On my reading list, Chaos Engineering
- Always assume that any part of your system can fail, make sure to engineer the rest of your system to gracefully catch & handle that failure when it happens
Forward logs to a central location, and monitor it
- Make sure your code never silently fails. Log anything unexpected that happens on the critical path, then pay attention to it. Anonymise customer data ofc.
As mentioned by Gergely, dog-fooding + manual QA
Quantify the impact of your company’s product crashing irrecoverably for a couple of hours. This thought exercise can help identify how much you have to invest in your outage mitigation strategy.
Treat config changes the same way you treat code changes.
- Related reading Infrastructure as code