You are viewing content from a past/completed QCon - November 2020

A Sticky Situation: How Netflix Gains Confidence in Changes

What You'll Learn

1Hear about sticky canaries, what they are, and how they can help.

2Learn how to build confidence in changes to your system by tying canaries to the end user experience.

How do you know whether a change will affect end users in a negative way? As interactions in distributed systems grow increasingly complex, it can be challenging to get an answer to this question. 

One approach is to use a canary in which we introduce a new service into the environment, users are randomly routed to that service, and we compare the performance of that service to the current production build. However, this doesn’t really tell us anything about what the end users are experiencing -- it focuses on service-level metrics. In reality, a service may be happily serving successful requests, yet the end user is not able to use your product.

As a result, it can be useful to have a methodology which enables teams to observe the full impact of a change on end users. In this talk, I will demonstrate how Netflix uses sticky canaries to fulfill this need and I will highlight use cases where we have employed this methodology successfully. I will also cover the key platform features and tools to include when implementing sticky canaries.

Key Takeaways

  • What is a sticky canary?
  • What types of use cases may benefit from this methodology?
  • Platform and tooling investments required to make this a success.

What are you working on these days?

My team owns a collection of tools that allow users to run experiments on their production systems. This can be anything from a standard deployment canary where I have a new build and I want to see if that's going to work before I roll it out completely. We also provide load tests where we slowly ramp up traffic to a cluster. We automatically monitor for impacts and see when we need to shut it down in order to find the point at which the maximum throughput for the service is reached. We also have a collection of chaos tools that allow you to run experiments where you can fail a particular subset of calls and see if you're resilient to those types of failures. We own a platform for enabling these types of tests and we also do consulting with teams to help them work through any reliability challenges they may have.

What is the goal for your presentation?

The goal for my presentation is to show people something we've found really useful for many of our experiment types, called sticky canaries. You can think of it as a little mini product A/B test, but for testing your system and infrastructure changes. It has allowed us to get stronger signals about the effect of a change on end users which can help teams build confidence in a change if they see it is not breaking our key performance indicators. I will share the strategy that we've taken to achieve this because I think it is a useful tool that other companies can apply in their own domains.

And what would you want the attendees to walk out of your presentation with?

The key takeaway is a new methodology they can use to build confidence in their own systems. It doesn’t come for free -- there is platform tooling that would need to be built -- but I think it is a useful pattern for certain types of use cases. I’ll talk about which use cases are a good fit and the benefits of using this methodology.


Haley Tucker

Senior Software Engineer, Resilience Team @Netflix
Haley Tucker is a member of the Resilience Engineering team at Netflix where she is responsible for improving the reliability of the Netflix ecosystem by supporting developers and building trustable and safe tooling. Prior to that, she worked on the Playback Features team where her services... Read more Find Haley Tucker at:

From the same track

View full Schedule


Build your learning journey and level-up on the skills most in-demand in 2021. Attend QCon Plus (May 17-28, 2021).

Save your spot for $549 before May 1st