You are viewing content from a past/completed QCon Plus - November 2020

Session

A Sticky Situation: How Netflix Gains Confidence in Changes

How do you know whether a change will affect end users in a negative way? As interactions in distributed systems grow increasingly complex, it can be challenging to get an answer to this question. 

One approach is to use a canary in which we introduce a new service into the environment, users are randomly routed to that service, and we compare the performance of that service to the current production build. However, this doesn’t really tell us anything about what the end users are experiencing -- it focuses on service-level metrics. In reality, a service may be happily serving successful requests, yet the end user is not able to use your product.

As a result, it can be useful to have a methodology which enables teams to observe the full impact of a change on end users. In this talk, I will demonstrate how Netflix uses sticky canaries to fulfill this need and I will highlight use cases where we have employed this methodology successfully. I will also cover the key platform features and tools to include when implementing sticky canaries.

Key Takeaways

  • What is a sticky canary?
  • What types of use cases may benefit from this methodology?
  • Platform and tooling investments required to make this a success.

Main Takeaways

1 Hear about sticky canaries, what they are, and how they can help.

2 Learn how to build confidence in changes to your system by tying canaries to the end user experience.


What are you working on these days?

My team owns a collection of tools that allow users to run experiments on their production systems. This can be anything from a standard deployment canary where I have a new build and I want to see if that's going to work before I roll it out completely. We also provide load tests where we slowly ramp up traffic to a cluster. We automatically monitor for impacts and see when we need to shut it down in order to find the point at which the maximum throughput for the service is reached. We also have a collection of chaos tools that allow you to run experiments where you can fail a particular subset of calls and see if you're resilient to those types of failures. We own a platform for enabling these types of tests and we also do consulting with teams to help them work through any reliability challenges they may have.

What is the goal for your presentation?

The goal for my presentation is to show people something we've found really useful for many of our experiment types, called sticky canaries. You can think of it as a little mini product A/B test, but for testing your system and infrastructure changes. It has allowed us to get stronger signals about the effect of a change on end users which can help teams build confidence in a change if they see it is not breaking our key performance indicators. I will share the strategy that we've taken to achieve this because I think it is a useful tool that other companies can apply in their own domains.

And what would you want the attendees to walk out of your presentation with?

The key takeaway is a new methodology they can use to build confidence in their own systems. It doesn’t come for free -- there is platform tooling that would need to be built -- but I think it is a useful pattern for certain types of use cases. I’ll talk about which use cases are a good fit and the benefits of using this methodology.


Speaker

Haley Tucker

Senior Software Engineer, Resilience Team @Netflix

Haley Tucker is a member of the Resilience Engineering team at Netflix where she is responsible for improving the reliability of the Netflix ecosystem by supporting developers and building trustable and safe tooling. Prior to that, she worked on the Playback Features team where her services...

Read more
Find Haley Tucker at:

From the same track

Session

User Simulation for Rapid Outage Mitigation

Wednesday Nov 4 / 02:40PM EST

Uber operates in over 10,000 cities across the world, with different offerings and features in each market. People all over the world rely on Uber for critical aspects of their daily lives. Uber can be their source of income, their commuting strategy, or their ride to the hospital. In COVID...

Carissa Blossom

Member of Production Engineering team for Eats & Delivery @Uber

Session

Less Mess, Less Stress: The Reliability Benefits of Custom Tools

Wednesday Nov 4 / 01:00PM EST

Tooling is an often overlooked, yet critical, component of infrastructure for most engineering teams. Adopting new infrastructure is easier than ever thanks to the cloud-native movement, open-source community, and the proliferation of Infrastructure-as-a-Service (IaaS) providers. However, each...

Daniel Hochman

Platform Engineer @Lyft

PANEL DISCUSSION

Operational Excellence Panel

Wednesday Nov 4 / 03:30PM EST

Being on call for a production system can be stressful whether it is your first time or you have been carrying a pager for years. When that alert goes off, will you be prepared? Will your system reliability mechanisms behave as intended? If not, are you able to debug and understand what’s...

Carissa Blossom

Member of Production Engineering team for Eats & Delivery @Uber

Tammy Bryant Butow

Principal Site Reliability Engineer @Gremlin

Suudhan Rangarajan

Senior Software Engineer @Netflix

View full Schedule

Less than

16

weeks until QCon Plus May 2022

Level-up on the emerging software trends and practices you need to know about.

Deep-dive with world-class software leaders at QCon Plus (Nov 1-12, 2021).

Save your spot for $549 before February 7th

Register