Retro: Crowdstrike outage

Bullet points from video

  • There’s no real financial or criminal consequences for outages like this
    • Stock market may or may not reflect such outages, but that’s about it
    • Mental note to check out the aftermath of other big outages in the past.
      • How are we defining big?

Mitigation

Such risks can be handled at a software, deployment, and testing level.

  • Canary deployments
  • Staged rollouts
  • Rollback strategy
  • Separate high-risk operations into a separate process, that way your whole program doesn’t crash
  • Sanity check files before you load them into your program. How this is done is very context dependent.
  • Write integration tests that force errors, and make sure they’re handled correctly
    • On my reading list, Chaos Engineering
    • Always assume that any part of your system can fail, make sure to engineer the rest of your system to gracefully catch & handle that failure when it happens
  • Forward logs to a central location, and monitor it
    • Make sure your code never silently fails. Log anything unexpected that happens on the critical path, then pay attention to it. Anonymise customer data ofc.
  • As mentioned by Gergely, dog-fooding + manual QA
  • Quantify the impact of your company’s product crashing irrecoverably for a couple of hours. This thought exercise can help identify how much you have to invest in your outage mitigation strategy.
  • Treat config changes the same way you treat code changes.
    • Related reading Infrastructure as code