Crypto Agility: Risk Management For When Your Service Can Take Down Everyone

In May 2022 the White House released a National Security memorandum for cryptographic agility: as part of the required migration to quantum-resistant algorithms, all agencies have been ordered to build features to “enable future updates to cryptographic algorithms and standards without the need to modify or replace the surrounding infrastructure”. At Google, we’ve been custodians of sensitive encrypted data for a while, and we’re extremely familiar with the hurdles involved in deploying agility for a mature software ecosystem.

Crypto agility is one aspect of cryptographic best practice, which is that users should be able to easily upgrade security primitives, as well as update (“rotate”) key material on a regular schedule. Rotating regularly reduces the blast radius of a potential compromise and rotating immediately is required to remediate an actual compromise. But what issues does this run up against in practice? For one there are standard change management limitations - any sort of change to key material carries reliability risk and can lead to customer outages. This reliability risk is often in tension with the security risk of enforcing crypto agility. 

We’ll walk through the system Google developed to instrument and monitor all cryptographic operations (decryptions, verifications, etc) across 1000+ internal teams and millions of production jobs, and mediate risky rotation on customers’ behalfs, therefore allowing for easy and safe crypto agility by default.

Key Takeaways

  1. Control reliability risk during feature rollout with a server-triggered flag flip. Use unique task ID hashing to deterministically trigger roll out and rollback.
  2. Monitor on multiple levels to ensure data comprehensiveness: direct rollout monitoring, error monitoring, and dependency monitoring.
  3. A long-term risk-mitigation framework for trading off security for reliability

What is the focus of your work these days?

I'm a senior software engineer on Google's internal Key Management System, which is part of the broader Security Foundations team. In the last few weeks, what I've been putting most of my time on is improving our build and release process for getting new changes out to production. We run on top of a very complex build and release system with interconnected components. We’re migrating and adapting this build system as part of our move to regionalize infrastructure so that we don’t push changes to multiple Cloud regions at the same time, because that reduces reliability. So right now, a lot of these components are failing to cooperate and it's made releasing take an extra long time and become very toilsome for our oncall. In the longer term I'm leading a project to add instrumentation into cryptographic library methods so that we can pick up usage metadata from clients of our service, on where and how our keys are used. These are collected from tens of millions of tasks in production from teams across Google that use encryption at rest, like Docs, Photos, Maps and Research.  We ingest that back into a data pipeline that comes into our system and helps us guide our key management decisions. 

And then the other thing I  think about is how to do team building and team mentorship nowadays.

What's the motivation for your talk?

I'm curious to see how other people are facing this problem at different companies, of our size and other sizes. I think crypto agility is a challenge that affects anyone who uses cryptography. Google has unique challenges because of our scale, and because we have more than just server-side keys, we also return keys back to client services. And I'm also just excited to see what the audience thinks of the talk, if they've run into similar implementation challenges or organizational challenges.

Speaking of the audience, how would you describe the persona and level of the target audience for your session?

People who are either interested in real-world cryptography or are interested in techniques for rolling out a risky feature at a large scale One of the takeaways from this talk is how to roll out incrementally, without storing identifiers on the tasks or services that you're contacting.

Are there any other takeaways that you would like these folks that go to your session to take away as well?

I think these will be pretty chunky. They are tied back into the wider problem of trying to develop while maintaining an extremely reliable system and trying to roll out across a mature software ecosystem without affecting the latency or reliability of your clients. Broadly, difficult questions that I think everyone at the conference will be thinking about. 


Anvita Pandit

Senior Software Engineer @Google

Anvita Pandit is a senior software engineer on the Security Foundations team at Google, which manages encryption of user data at rest through the internal Key Management Services. Her ongoing project is to bring easy and safe cryptographic agility to all Google services by default in collaboration with the Tink team.

Read more
Find Anvita Pandit at:

From the same track


Adopting Continuous Delivery at Lyft

All organizations, regardless of size, need to be able to make rapid changes and improvements in their constantly growing systems. How can we handle all this change while maintaining a reliable product? 

Tom Wanielista

Senior Staff Software Engineer @Lyft


Dark Side of DevOps

Topics like “you build it, you run it” and “shifting testing/security/data governance left” are popular: moving things to the earlier stages of software development, empowering engineers, shifting control definitely sounds good.

Mykyta Protsenko

Senior Software Engineer @Netflix


Log4Shell Response Patterns & Learnings From Them

In early December 2021, rumors about a remote code execution (RCE) vulnerability in Log4j began circulating on social media, dubbed Log4Shell. Over the next three days, those rumors were confirmed and the immense scope of the vulnerability became clear.

Tapabrata Pal

Vice President of Architecture @Fidelity