You are viewing content from a past/completed QCon - November 2020

Safe and Fast Deploys at Planet Scale

What You'll Learn

1Hear about Uber’s journey from their on-prem system to the new multi hybrid cloud one that aims to automate many of the tasks performed by it.

2Learn about software management, scalability and the need to have these done automatically by software.

Can you write code, review, test, verify, and ship it safely to millions of users, all in the same day? Absolutely!

Every week thousands of Uber engineers push out several thousand changes to millions of users. This means that during working hours, some part of the Uber system starts upgrading every single minute and that the system never runs one single version across the host fleet.

In this talk, we will explore the lessons we learned as we scaled from a small engineering team using a single datacenter to thousands of engineers that continuously deploy changes across multiple cloud platforms, with a focus on maintaining fast and reliable delivery of software changes.

What is the work that you're doing nowadays?

I'm the tech lead for Uber's stateless service platform. Essentially, I'm leading a project that's working on building our service platform, which is the way that you deploy any type of stateless software into Uber's data centers. We're currently replacing our old system with a new system, a multi cloud automated system that can handle the software deploys and scaling. Right now we are in the last phases of making that production-ready. We are onboarding all of the workloads onto this new system. I also built much of the old platform so I'm quite experienced with this area.

What were the things in the old system that you hadn't automated and are automating now in the new system?

It was a result of us growing. We automated certain things with the old system. We automated rollouts. We already had safe rollouts across zones. The stuff that we are doing now relates to also automating placement. Uber is a hybrid cloud model where we own some zones as  on-prem capacity and we have some cloud capacity at Amazon and Google Cloud. One of the things we're automating is placement and the moving of services between cloud and on-prem as well as between clouds. We are also applying auto scaling, at a large scale. We haven't previously had that.

Previously you had no auto scaling or was it just not as 'auto' as you wanted it to be?

We had some auto scaling, but it wasn't fully automated. We're moving to a hybrid cloud model where previously we were mostly on-prem. Auto scaling is getting even more important when you are in the cloud. You can more easily scale your capacity in cloud than you can on-prem.

What are the goals for the talk and what do you want people to take away from the talk?

One of the goals is to take the audience on this journey from being a small company, which Uber was when I started, where everything could be managed manually. I'd like to take people through the stages of how and what you can automate as you grow the company and want to keep the system lean and want to keep releasing at a high velocity every day. And we've been able to keep that cadence even as the number of engineers has been increasing, as the complexity of our systems have been increasing as we have come to this multi-cloud model that we're deploying stuff to. I'd also like to show people how you can deploy fast and independently in a safe way. Show all the safety mechanisms that go into making these fast deploys work at scale for thousands of engineers. There's some descriptions of that in the blog post that was the origin for this talk but we've done a bit more work since then.

When you say "safe" deploys, what do you mean specifically?

Safe means that you can have confidence that even if you've made a change that isn't good to the code, you can still trust the system to be able to roll back the change or failover or somehow mitigate that outage for you before bad things happen. Safety is basically the ability that enables you to be able to move fast in a big system. You should be able to trust the system to save you when you do something that isn't correct.


Mathias Schwarz

Software Engineer @Uber
Mathias Schwarz has been an infrastructure engineer at Uber for more than 5 years. He and his team is responsible for the deployment platform for stateless services used across all of Uber engineering. Mathias has a PhD in Computer Science from the Programming Languages group at Aarhus University. Read more

Thursday Nov 5 / 12:50PM EST (40 minutes)

TRACK Paths to Production: Deployment Pipelines as a Competitive Advantage ADD TO CALENDAR Calendar IconAdd to calendar

Build your learning journey and level-up on the skills most in-demand in 2021. Attend QCon Plus (May 17-28, 2021).

Save your spot for $599 before May 28th