You are viewing content from a past/completed QCon - November 2020


Architecting for Confidence: Building Resilient Systems

For any complex system, there is a wide array of activities that can increase system reliability and operator confidence. Each activity contributes in a different way.

If your system is safety-sensitive, you may invest heavily in pre-production testing strategies. If you want to holistically understand the effect of a change on individual users, you may use a sticky canary. If you don’t know your resource limits or bottlenecks, a load test could be useful. To validate design decisions around reliability mechanisms that don’t get exercised regularly, you may run chaos experiments. All these activities converge to build a stronger system that holds up to the pressures of production, but eventually your operators will have to engage to triage outages. When they do, it’s important they are comfortable doing so.

In this track, we will delve into each of these areas to provide attendees with the tools they need to build resilient systems and empower operators.

From this track


Less Mess, Less Stress: The Reliability Benefits of Custom Tools

Wednesday Nov 4 / 10:00AM PST

Tooling is an often overlooked, yet critical, component of infrastructure for most engineering teams. Adopting new infrastructure is easier than ever thanks to the cloud-native movement, open-source community, and the proliferation of Infrastructure-as-a-Service (IaaS) providers. However, each...

Daniel Hochman

Platform Engineer @Lyft


A Sticky Situation: How Netflix Gains Confidence in Changes

Wednesday Nov 4 / 10:50AM PST

How do you know whether a change will affect end users in a negative way? As interactions in distributed systems grow increasingly complex, it can be challenging to get an answer to this question. One approach is to use a canary in which we introduce a new service into the environment, users...

Haley Tucker

Senior Software Engineer, Resilience Team @Netflix


User Simulation for Rapid Outage Mitigation

Wednesday Nov 4 / 11:40AM PST

Uber operates in over 10,000 cities across the world, with different offerings and features in each market. People all over the world rely on Uber for critical aspects of their daily lives. Uber can be their source of income, their commuting strategy, or their ride to the hospital. In COVID...

Carissa Blossom

Member of Production Engineering team for Eats & Delivery @Uber

Interactive Session

Operational Excellence Panel

Wednesday Nov 4 / 12:30PM PST

Being on call for a production system can be stressful whether it is your first time or you have been carrying a pager for years. When that alert goes off, will you be prepared? Will your system reliability mechanisms behave as intended? If not, are you able to debug and understand what’s...

Carissa Blossom

Member of Production Engineering team for Eats & Delivery @Uber

Tammy Bryant Butow

Principal Site Reliability Engineer @Gremlin

Suudhan Rangarajan

Senior Software Engineer @Netflix


Wednesday Nov 4 / 09:00AM PST


Track Host

Haley Tucker

Senior Software Engineer, Resilience Team @Netflix

Haley Tucker is a member of the Resilience Engineering team at Netflix where she is responsible for improving the reliability of the Netflix ecosystem by supporting developers and building trustable and safe tooling. Prior to that, she worked on the Playback Features team where her services...

Read more
Find Haley Tucker at:

Build your learning journey and level-up on the skills most in-demand in 2021. Attend QCon Plus (Nov 1-5, 2021).

Save your spot for $549 before August 31st