Architecting for Confidence: Building Resilient Systems

For any complex system, there is a wide array of activities that can increase system reliability and operator confidence. Each activity contributes in a different way.

If your system is safety-sensitive, you may invest heavily in pre-production testing strategies. If you want to holistically understand the effect of a change on individual users, you may use a sticky canary. If you don’t know your resource limits or bottlenecks, a load test could be useful. To validate design decisions around reliability mechanisms that don’t get exercised regularly, you may run chaos experiments. All these activities converge to build a stronger system that holds up to the pressures of production, but eventually your operators will have to engage to triage outages. When they do, it’s important they are comfortable doing so.

In this track, we will delve into each of these areas to provide attendees with the tools they need to build resilient systems and empower operators.


Haley Tucker

Senior Software Engineer, Resilience Team @Netflix

Haley Tucker is a member of the Resilience Engineering team at Netflix where she is responsible for improving the reliability of the Netflix ecosystem by supporting developers and building trustable and safe tooling. Prior to that, she worked on the Playback Features team where her services filled a key role in enabling Netflix to stream amazing content to millions of members on thousands of device types worldwide. Prior to Netflix, Haley spent a few years building near-real-time command and control systems at Raytheon. She then moved into a consulting role where she built custom billing and payment solutions for cloud and telephony service providers. Haley enjoys applying new technologies to develop robust and maintainable systems and the scale at Netflix has been a unique and exciting challenge. Haley received a BS in Computer Science from Texas A&M University.

Find Haley Tucker at:

Wednesday Nov 4 / 09:00AM PST


From this track


User Simulation for Rapid Outage Mitigation

Wednesday Nov 4 / 11:40AM PST

Uber operates in over 10,000 cities across the world, with different offerings and features in each market. People all over the world rely on Uber for critical aspects of their daily lives. Uber can be their source of income, their commuting strategy, or their ride to the hospital. In COVID times, our role is all the more critical with Uber Eats acting as the central source of income for an increasing number of restaurants and couriers. With so much at stake, we don’t have the luxury of waiting for aggregate production tracking metrics to notice that some discernible percentage of users are experiencing an issue -- the changes made to our 4000+ microservice architecture are rolled out by both engineering and operations teams leveraging multiple different tools at the city, zone, region and global levels at a frequency that is impossible to coordinate. In this environment, how does Uber maintain a high level of reliability and prevent outages before they are felt by users?    

In this talk, I walk through the independent, external monitoring service that Uber developed to identify issues in production at the individual city level all across the globe, and how we leveraged composable integration tests simulating thousands of diverse, test users to cut our time to mitigation in half. Attendees will also learn how Uber surfaces and predicts its most dire outages and how the combination of machine learning and tracing enables us to reliably narrow down the root cause of an outage. 

Carissa Blossom Member of Production Engineering team for Eats & Delivery @Uber

3 weeks of live software engineering content designed around your schedule.

Don’t miss out! Save your seat now