You are viewing content from a past/completed QCon - May 2021

Session

Observing and Understanding Failures: SRE Apprentices

In this session, Tammy will share how Padawans and Jedis can inspire and teach us how to help people of a wide variety of backgrounds, ages, and experience levels to observe and understand failures in production. Tammy will share how she and a colleague created an SRE Apprentice program to hire and train new SREs who wanted a career change. Tammy will cover practical lessons learned, things she'd change and she'll also share how you can create and rollout a program for SRE Apprentices within your organization. Tammy will also share feedback from the SRE Apprentices themselves.  Is it difficult to observe and understand failures? Why is training from someone more experienced helpful? What are the hardest and easiest things to learn about observing and understanding failures as an SRE for 500 million+ users?


Speaker

Tammy Bryant Butow

Principal Site Reliability Engineer @Gremlin

Tammy Butow is the principal SRE at Gremlin, where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Tammy previously led SRE teams at Dropbox...

Read more
Find Tammy Bryant Butow at:

Date

Tuesday May 18 / 08:00AM PDT (40 minutes)

Track

Observability and Understandability in Production

Topics

ObservabilityDevopsIncident Management

Add to Calendar

Add to calendar

Share

From the same track

Session Observability

Resources & Transactions: A Fundamental Duality in Observability

Tuesday May 18 / 09:00AM PDT

Fundamentally, there are only two types of “things worth observing” when it comes to production systems:Resources, andTransactionsThe tricky (and interesting) part is that they’re entirely codependent. “Transactions” are the things that traverse your system and...

Ben Sigelman

CEO and co-founder @LightStepHQ, Co-creator @OpenTracing API standard

Session Incident Management

More More More! Why the Most Resilient Companies Want More Incidents

Tuesday May 18 / 07:00AM PDT

Major tech companies like Facebook, Google, and Netflix want more incidents, not fewer. NASA wants them so urgently that they import incidents from other companies. The reason? Postmortems. This talk will focus on how companies of any scale can improve their ingestion of understandability by...

John Egan

CEO and Co-Founder @Kintaba

Session Observability

Panel: Observability and Understandability

Tuesday May 18 / 10:00AM PDT

This panel will feature experienced practitioners who have worked in the engineering teams of Google, Facebook, Dropbox. & MongoDB. They are all now working at startups focused on helping engineers improve their ability to reduce downtime and customer-impacting failures. Hear from this panel...

Jason Yee

Director of Advocacy @Gremlin

John Egan

CEO and Co-Founder @Kintaba

Ben Sigelman

CEO and co-founder @LightStepHQ, Co-creator @OpenTracing API standard

View full Schedule

Logo

Build your learning journey and level-up on the skills most in-demand in 2021. Attend QCon Plus (Nov 1-12, 2021).

Save your spot for $699 before October 9th

Register