Fool Me Once: Building a Culture of "Shared Outages"

July 12

Calvin French-Owen
CTO & Co-Founder
Segment / Full Bio

Over the past five years of being on-call, seeing hundreds of production outages, and working with a growing SRE team, I've started to see patterns when it comes to making teams work effectively.

What I'd like to share as part of this talk is an actual walkthrough of a production incident. The tooling we used. What we thought was going on. How we diagnosed it over time. How we eventually got to the root cause. And how we postmortemed it afterward.

I think there are a ton of great 'general' guides out there, like the Google SRE book, or posts on a culture of blameless postmortems. But few actually tell the story and walk through a 'play-by-play' of how an outage was handled.

I'd like to share the exact details of what we did and why–and share some tips around the tools and processes that have made the biggest differences in terms of reliability.

Back to Agenda Page >