You are viewing content from a past/completed QCon Plus - November 2020


Less Mess, Less Stress: The Reliability Benefits of Custom Tools

Tooling is an often overlooked, yet critical, component of infrastructure for most engineering teams. Adopting new infrastructure is easier than ever thanks to the cloud-native movement, open-source community, and the proliferation of Infrastructure-as-a-Service (IaaS) providers. However, each new infrastructure component comes with its own set of configuration, tooling, logs, and metrics resulting in increased cognitive load. Additionally, initial tooling strategies rarely scale to more complicated architectures or more stringent SLAs.

Clutch is an open-source platform from Lyft that empowers organizations to take control of their operator experience. While augmenting or replacing existing tools requires investment, it comes with significant benefits for developers and end-users. Clutch aims to streamline key actions in order to reduce the time to resolve incidents (MTTR), lower onboarding costs for new engineers, lower overall cognitive load when interacting with infrastructure, improve developer productivity, and even prevent incidents by eliminating the chance of accidents during normal maintenance.

Main Takeaways

1 How an overreliance on vendor tooling leads to worse reliability outcomes.

2 How Lyft lowered MTTR for its most common alerts using custom tooling.

3 How Clutch can extend to your organization's operational use cases.


Daniel Hochman

Platform Engineer @Lyft

Daniel Hochman is the tech lead of the platform tools team at Lyft and the creator of Clutch, the open-source platform for infrastructure management. As an early engineer at Lyft, Daniel successfully guided the platform through the explosion of product and organizational growth. He wrote one of...

Read more


Wednesday Nov 4 / 01:00PM EST (40 minutes)


Architecting for Confidence: Building Resilient Systems

Add to Calendar

Add to calendar


From the same track


User Simulation for Rapid Outage Mitigation

Wednesday Nov 4 / 02:40PM EST

Uber operates in over 10,000 cities across the world, with different offerings and features in each market. People all over the world rely on Uber for critical aspects of their daily lives. Uber can be their source of income, their commuting strategy, or their ride to the hospital. In COVID...

Carissa Blossom

Member of Production Engineering team for Eats & Delivery @Uber


A Sticky Situation: How Netflix Gains Confidence in Changes

Wednesday Nov 4 / 01:50PM EST

How do you know whether a change will affect end users in a negative way? As interactions in distributed systems grow increasingly complex, it can be challenging to get an answer to this question. One approach is to use a canary in which we introduce a new service into the environment, users...

Haley Tucker

Senior Software Engineer, Resilience Team @Netflix


Operational Excellence Panel

Wednesday Nov 4 / 03:30PM EST

Being on call for a production system can be stressful whether it is your first time or you have been carrying a pager for years. When that alert goes off, will you be prepared? Will your system reliability mechanisms behave as intended? If not, are you able to debug and understand what’s...

Carissa Blossom

Member of Production Engineering team for Eats & Delivery @Uber

Tammy Bryant Butow

Principal Site Reliability Engineer @Gremlin

Suudhan Rangarajan

Senior Software Engineer @Netflix

View full Schedule

Less than


weeks until QCon Plus May 2022

Level-up on the emerging software trends and practices you need to know about.

Deep-dive with world-class software leaders at QCon Plus (Nov 1-12, 2021).

Save your spot for $549 before February 7th