You are viewing content from a past/completed QCon Plus - November 2020

Track Overview

Architecting for Confidence: Building Resilient Systems

For any complex system, there is a wide array of activities that can increase system reliability and operator confidence. Each activity contributes in a different way.

If your system is safety-sensitive, you may invest heavily in pre-production testing strategies. If you want to holistically understand the effect of a change on individual users, you may use a sticky canary. If you don’t know your resource limits or bottlenecks, a load test could be useful. To validate design decisions around reliability mechanisms that don’t get exercised regularly, you may run chaos experiments. All these activities converge to build a stronger system that holds up to the pressures of production, but eventually your operators will have to engage to triage outages. When they do, it’s important they are comfortable doing so.

In this track, we will delve into each of these areas to provide attendees with the tools they need to build resilient systems and empower operators.


From this track

Session

Less Mess, Less Stress: The Reliability Benefits of Custom Tools

Wednesday Nov 4 / 01:00PM EST

Tooling is an often overlooked, yet critical, component of infrastructure for most engineering teams. Adopting new infrastructure is easier than ever thanks to the cloud-native movement, open-source community, and the proliferation of Infrastructure-as-a-Service (IaaS) providers. However, each...

Daniel Hochman

Platform Engineer @Lyft

Session

A Sticky Situation: How Netflix Gains Confidence in Changes

Wednesday Nov 4 / 01:50PM EST

How do you know whether a change will affect end users in a negative way? As interactions in distributed systems grow increasingly complex, it can be challenging to get an answer to this question. One approach is to use a canary in which we introduce a new service into the environment, users...

Haley Tucker

Senior Software Engineer, Resilience Team @Netflix

Session

User Simulation for Rapid Outage Mitigation

Wednesday Nov 4 / 02:40PM EST

Uber operates in over 10,000 cities across the world, with different offerings and features in each market. People all over the world rely on Uber for critical aspects of their daily lives. Uber can be their source of income, their commuting strategy, or their ride to the hospital. In COVID...

Carissa Blossom

Member of Production Engineering team for Eats & Delivery @Uber

PANEL DISCUSSION

Operational Excellence Panel

Wednesday Nov 4 / 03:30PM EST

Being on call for a production system can be stressful whether it is your first time or you have been carrying a pager for years. When that alert goes off, will you be prepared? Will your system reliability mechanisms behave as intended? If not, are you able to debug and understand what’s...

Carissa Blossom

Member of Production Engineering team for Eats & Delivery @Uber

Tammy Bryant Butow

Principal Site Reliability Engineer @Gremlin

Suudhan Rangarajan

Senior Software Engineer @Netflix


Speakers from this track

Daniel Hochman

Platform Engineer @Lyft

Daniel Hochman is the tech lead of the platform tools team at Lyft and the creator of Clutch, the open-source platform for infrastructure management. As an early engineer at Lyft, Daniel successfully guided the platform through the explosion of product and organizational growth. He wrote one of...

Read more

Haley Tucker

Senior Software Engineer, Resilience Team @Netflix

Haley Tucker is a member of the Resilience Engineering team at Netflix where she is responsible for improving the reliability of the Netflix ecosystem by supporting developers and building trustable and safe tooling. Prior to that, she worked on the Playback Features team where her services...

Read more
Find Haley Tucker at:

Carissa Blossom

Member of Production Engineering team for Eats & Delivery @Uber

Carissa Blossom is a member of the Production Engineering team for Eats & Delivery at Uber. She is also an Incident Commander for Ring0, Uber Engineering’s primary task force for critical outage mitigation. In her four years at Uber, she has served as Production Engineer for Eats,...

Read more

Tammy Bryant Butow

Principal Site Reliability Engineer @Gremlin

Tammy Butow is the principal SRE at Gremlin, where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Tammy previously led SRE teams at Dropbox...

Read more
Find Tammy Bryant Butow at:

Suudhan Rangarajan

Senior Software Engineer @Netflix

Suudhan Rangarajan works on the Playback API team at Netflix, responsible for ensuring that customers receive the best possible playback experience every time they click play. A few dozen playback microservices fill a key role in enabling Netflix to stream amazing content to 125M+ members on...

Read more
Find Suudhan Rangarajan at:

Track Date

Wednesday Nov 4 / 12:00PM EST

Share

Register

Register for QCon Plus
May 10 - 20, 2022

Register

Track Host

Haley Tucker

Senior Software Engineer, Resilience Team @Netflix

Haley Tucker is a member of the Resilience Engineering team at Netflix where she is responsible for improving the reliability of the Netflix ecosystem by supporting developers and building trustable and safe tooling. Prior to that, she worked on the Playback Features team where her services...

Read more
Find Haley Tucker at:

Less than

22

weeks until QCon Plus May 2022

Level-up on the emerging software trends and practices you need to know about.

Deep-dive with world-class software leaders at QCon Plus (Nov 1-12, 2021).

Save your spot for $499 before January 10th

Register