In theory, having a solid SRE program is required for successful cloud IT services. In practice, not all SRE programs are created equal, and in fact many attempts to establish an SRE program have failed or even backfired. What distinguishes a good SRE program from a bad one, and is there a framework for preventing catastrophe? This isn't an idle discipline. Many of the practices mentioned in the original SRE materials are now hotly contested, like MTTR, root cause analysis, and runbook automation.
SRE deserves a critical and optimistic review. For example, are qualitative methods better suited to evaluate the success of SRE over quantitative methods? As a practice, SRE may be the software industry's best hope of holding off governmental regulations around availability, uptime, and certification of software engineers. That hope can only be fulfilled if we have effective SRE practices.
From this track
Rethinking Reliability: What You Can (and Can't) Learn From Incidents
Tuesday Dec 6 / 09:00AM PST
This talk presents research collected from the VOID—an open database of public incident reports. Containing over 2,000 reports for almost 700 organizations, the database allows for more structured review and research about software-related incident reporting.
Internet Incident Librarian & Senior Research Analyst @Verica
Did the Chaos Test Pass?
Tuesday Dec 6 / 10:10AM PST
People used to ask me all the time how to figure out if their chaos test has “passed,” and I’d always say “well, that’s a loaded question.” To confirm that a chaos test “passed,” we need to do verification of hypotheses - sometimes you’re trying to prove some system behavior occurred in response
Senior Site Reliability Engineering Specialist @Vanguard_Group
The Endgame of SRE
Tuesday Dec 6 / 11:20AM PST
The containers are deployed and the builds are green. Yaml flows through the system, linted, reviewed, tested, and shipped with ease and regularity. Our intrepid SRE finds themself at a crossroads. The infrastructure is great but teams still struggle to maintain error budgets.
Senior Principal Engineer and SRE Practice Leader @Equinix
The Eternal Sunshine of the Toil-Less Prod
Tuesday Dec 6 / 12:30PM PST
One of the most important decisions in building an SRE practice is what kind of work should be assigned to the SRE team, and in what percentages.
Director of the Cloud Services Black Belt Team @RedHat