When we encounter undesirable outcomes, there is a natural instinct to look back, find something that went wrong, and fix it. But looking back in this way doesn’t actually help us as much as we think because we know what went wrong and what it took to fix it. To the responders in the incident, that knowledge only came after the hard stuff - detecting, diagnosing, and repairing the problem. In this talk we’ll use the question “how did it make sense at the time?” to flip perspectives both concretely and philosophically.
Our systems are built and operated by humans making decisions; what we see when we look at “failure” is people taking actions that made sense at the time even if they seem ridiculous afterwards. For example, when overloaded, a web service experiences catastrophic collapse which, in hindsight, might have been avoided by a different rate limiter setting. But this setting was configured — perhaps months ago — by a developer who intended the service to be robust!
How did the configuration — which later enabled sadness — make sense at the time? What might we change to help developers make better decisions in the future? In this talk, we’ll explore the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions you can take to weave this perspective into your software development and incident lifecycles to help teams be more resilient in the face of complexity.
Interview:
What's the focus of your work these days?
I’m the tech lead at Stripe on the API events team, focused on things like web hooks and notifications. I work collaboratively with other teams at Stripe, helping large projects row in the same direction. My focus really depends on the day, but it includes design, implementation, operations - the whole software development lifecycle; trying to deliver impact and success to Stripe users, my coworkers, and the company as a whole. I care about reliability a lot!
What would you say is the motivation behind your talk?
I’m really curious and interested in the question of safety and reliability in complex socio technical systems - like high growth tech companies. Specifically, it's interesting taking one concept – in this case, ‘How did it make sense at the time?’, questioning this cognitive perspective on what happens when things go wrong – and sharing it with people. This gives them an arrow for their quiver or a tool for their toolkit. I'm really happy to be speaking with the other folks on the track; I think the effective SRE track has a high overlap, and is very interesting for the attendees and the community.
How would you describe the persona and level of the target audience for your session?
This is really interesting and something I tweeted: “Are you the persona and target of this talk? Who are you?” I think the target audience is anyone who cares a lot about reliability and figuring out how things go right, how they go wrong and learning from that. That generally tends to be people who are a little bit more senior, rather than new college grads. It's people who have had some experience, who've seen some things happen.
Things fail - even though we have really smart people and really good technology. What's going on? This session is for people who are at the beginning of their exploration of the domain, which I imagine is why they come to the ‘Just Culture’ track in the first place. Maybe not folks who are full safety science nerds; if you’ve read every paper by Dekker, Woods and Cooke, this talk may be more of a refresher.
What would you like this persona to walk away with after your presentation?
There are two main things. One is, some familiarity with the concepts. I’d love for the target persona to walk away with a better understanding of how it makes sense to ask “How does it make sense at the time?” Why should I care?
The second thing I’d like them to walk away is the knowledge of how to pull some threads to actually make change at their organizations. Both as a concept and as a tool.
- Where would you ask?
- What are the options for weaving this in, your post-incident activities?
- How might you leverage this outside of the incident lifecycle?
Speaker
Jacob Scott
Staff Software Engineer @stripe
Jacob is a technologist who is deeply curious about reliability in complex socio-technical (software) systems. He is currently a staff software engineer in the Platform & Ecosystem group at Stripe, focused on user facing event systems. Outside of work, he might be found at a nearby park with his one year old daughter or pursuing his avocation of collecting employees-only tech swag. Do you have a Facebook “illuminati” hoodie you are willing to part with? DM him on Twitter!