How Did It Make Sense at the Time? Understanding Incidents As They Occurred, Not as They Are Remembered

When we encounter undesirable outcomes, there is a natural instinct to look back, find something that went wrong, and fix it. But looking back in this way doesn’t actually help us as much as we think because we know what went wrong and what it took to fix it.  To the responders in the incident, that knowledge only came after the hard stuff - detecting, diagnosing, and repairing the problem.  In this talk we’ll use the question “how did it make sense at the time?” to flip perspectives both concretely and philosophically. 

Our systems are built and operated by humans making decisions; what we see when we look at “failure” is people taking actions that made sense at the time even if they seem ridiculous afterwards. For example, when overloaded, a web service experiences catastrophic collapse which, in hindsight, might have been avoided by a different rate limiter setting. But this setting was configured — perhaps months ago — by a developer who intended the service to be robust!

How did the configuration — which later enabled sadness —  make sense at the time? What might we change to help developers make better decisions in the future? In this talk, we’ll explore the basics of failure in complex systems, the theory and practice of how it made sense at the time, and actions you can take to weave this perspective into your software development and incident lifecycles to help teams be more resilient in the face of complexity.

Interview:

What's the focus of your work these days?

I’m the tech lead at Stripe on the API events team, focused on things like web hooks and notifications. I work collaboratively with other teams at Stripe, helping large projects row in the same direction. My focus really depends on the day, but it includes design, implementation, operations - the whole software development lifecycle; trying to deliver impact and success to Stripe users, my coworkers, and the company as a whole. I care about reliability a lot!

What would you say is the motivation behind your talk?

I’m really curious and interested in the question of safety and reliability in complex socio technical systems - like high growth tech companies. Specifically, it's interesting taking one concept – in this case, ‘How did it make sense at the time?’, questioning this cognitive perspective on what happens when things go wrong – and sharing it with people. This gives them an arrow for their quiver or a tool for their toolkit. I'm really happy to be speaking with the other folks on the track; I think the effective SRE track has a high overlap, and is very interesting for the attendees and the community.

How would you describe the persona and level of the target audience for your session?

This is really interesting and something I tweeted: “Are you the persona and target of this talk? Who are you?” I think the target audience is anyone who cares a lot about reliability and figuring out how things go right, how they go wrong and learning from that. That generally tends to be people who are a little bit more senior, rather than new college grads. It's people who have had some experience, who've seen some things happen. 

Things fail - even though we have really smart people and really good technology. What's going on? This session is for people who are at the beginning of their exploration of the domain, which I imagine is why they come to the ‘Just Culture’ track in the first place. Maybe not folks who are full safety science nerds; if you’ve read every paper by Dekker, Woods and Cooke, this talk may be more of a refresher.

 

What would you like this persona to walk away with after your presentation?

There are two main things. One is, some familiarity with the concepts. I’d love for the target persona to walk away with a better understanding of how it makes sense to ask  “How does it make sense at the time?” Why should I care? 

The second thing I’d like them to walk away is the knowledge of how to pull some threads to actually make change at their organizations. Both as a concept and as a tool. 

  • Where would you ask? 
  • What are the options for weaving this in, your post-incident activities?
  • How might you leverage this outside of the incident lifecycle? 

Speaker

Jacob Scott

Staff Software Engineer @stripe

Jacob is a technologist who is deeply curious about reliability in complex socio-technical (software) systems. He is currently a staff software engineer in the Platform & Ecosystem group at Stripe, focused on user facing event systems. Outside of work, he might be found at a nearby park with his one year old daughter or pursuing his avocation of collecting employees-only tech swag. Do you have a Facebook “illuminati” hoodie you are willing to part with? DM him on Twitter! 

Read more
Find Jacob Scott at:

Date

Monday Dec 5 / 11:20AM PST ( 50 minutes )

Topics

Engineering Culture Complex Systems

Share

From the same track

Session Engineering Culture

Generous, High Fidelity Communication Is the Key to a Safe, Effective Team

Monday Dec 5 / 09:00AM PST

A team's ability to communicate effectively and disagree productively is directly related to its resilience towards incidents and interruptions.

Speaker image - Denise Yu

Denise Yu

Engineering Manager and Rubyist

Session Engineering Culture

Reckoning with the Harm We Do: In Search of Restorative Just Culture in Software and Web Operations

Monday Dec 5 / 12:30PM PST

“Psychological Safety” and “Blameless” postmortems are not enough. We’ve heard that we need a “Just Culture” but does that matter if your people are “stressed, exhausted, depleted, spent, drained”?

Speaker image - Jessica DeVita

Jessica DeVita

Sr. Software Engineering Manager - SRE @Microsoft

Session Engineering Culture

Recipes for Blameless Accountability

Monday Dec 5 / 10:10AM PST

 

Speaker image - Michelle Brush

Michelle Brush

Engineering Manager SRE @Google