Did the Chaos Test Pass?

People used to ask me all the time how to figure out if their chaos test has “passed,” and I’d always say “well, that’s a loaded question.” To confirm that a chaos test “passed,” we need to do verification of hypotheses - sometimes you’re trying to prove some system behavior occurred in response to a stimulus, while other times you’re trying to prove the absence of a change in system behavior. Take this already nebulous concept, and now think about making it generic enough that the core validation logic can be re-used by any engineer running any kind of experiment on any one of our products. Then, try to do all of this in a complex distributed technical environment where it’s hard enough just to determine whether an application was healthy in the first place! That’s exactly the problem that the chaos engineering team at Vanguard has been tackling with the recent addition of automated assertions to the internal chaos tooling. In this talk, you’ll learn about when it’s appropriate to define “pass” and “fail” for a chaos experiment, and when it might not be, and you’ll get to take a peek under the hood at the way that Vanguard engineers are automatically verifying their hypotheses in the context of chaos experiments.


Speaker

Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Christina is a Senior Site Reliability Engineering Specialist in Vanguard's Chief Technology Office. She has worked at the company's Malvern, PA headquarters since graduating from Villanova University with an undergraduate degree in Computer Science. Throughout her career, she has developed an expansive skill set in front- and back-end web development, as well as cloud infrastructure and automation, with a specialization in Site Reliability Engineering. She has earned several Amazon Web Services certifications, including the Solutions Architect - Professional. Christina has also worked closely with the Women's Initiative for Leadership Success at Vanguard, both internally at the company and externally in the local community, to further the career advancement of women and girls - in particular within the tech industry. In her spare time (and when it is safe to do so!), Christina is passionate about traveling; she has visited over 20 different countries and 25 U.S. states so far!

Read more
Find Christina Yakomin at:

Date

Tuesday Dec 6 / 10:10AM PST ( 50 minutes )

Topics

SRE Chaos Experiment

Share

From the same track

Session SRE

The Endgame of SRE

Tuesday Dec 6 / 11:20AM PST

The containers are deployed and the builds are green. Yaml flows through the system, linted, reviewed, tested, and shipped with ease and regularity. Our intrepid SRE finds themself at a crossroads. The infrastructure is great but teams still struggle to maintain error budgets.

Speaker image - Amy Tobey
Amy Tobey

Senior Principal Engineer and SRE Practice Leader @Equinix

Session SRE

Rethinking Reliability: What You Can (and Can't) Learn From Incidents

Tuesday Dec 6 / 09:00AM PST

This talk presents research collected from the VOID—an open database of public incident reports. Containing over 2,000 reports for almost 700 organizations, the database allows for more structured review and research about software-related incident reporting.

Speaker image - Courtney Nash
Courtney Nash

Internet Incident Librarian & Senior Research Analyst @Verica

Session SRE

The Eternal Sunshine of the Toil-Less Prod

Tuesday Dec 6 / 12:30PM PST

One of the most important decisions in building an SRE practice is what kind of work should be assigned to the SRE team, and in what percentages.

Speaker image - Sasha Rosenbaum
Sasha Rosenbaum

Director of the Cloud Services Black Belt Team @RedHat