You are viewing content from a past/completed QCon Plus - November 2020

Session

Safe and Fast Deploys at Planet Scale

Can you write code, review, test, verify, and ship it safely to millions of users, all in the same day? Absolutely!

Every week thousands of Uber engineers push out several thousand changes to millions of users. This means that during working hours, some part of the Uber system starts upgrading every single minute and that the system never runs one single version across the host fleet.

In this talk, we will explore the lessons we learned as we scaled from a small engineering team using a single datacenter to thousands of engineers that continuously deploy changes across multiple cloud platforms, with a focus on maintaining fast and reliable delivery of software changes.

Main Takeaways

1 Hear about Uber’s journey from their on-prem system to the new multi hybrid cloud one that aims to automate many of the tasks performed by it.

2 Learn about software management, scalability and the need to have these done automatically by software.


What is the work that you're doing nowadays?

I'm the tech lead for Uber's stateless service platform. Essentially, I'm leading a project that's working on building our service platform, which is the way that you deploy any type of stateless software into Uber's data centers. We're currently replacing our old system with a new system, a multi cloud automated system that can handle the software deploys and scaling. Right now we are in the last phases of making that production-ready. We are onboarding all of the workloads onto this new system. I also built much of the old platform so I'm quite experienced with this area.

What were the things in the old system that you hadn't automated and are automating now in the new system?

It was a result of us growing. We automated certain things with the old system. We automated rollouts. We already had safe rollouts across zones. The stuff that we are doing now relates to also automating placement. Uber is a hybrid cloud model where we own some zones as  on-prem capacity and we have some cloud capacity at Amazon and Google Cloud. One of the things we're automating is placement and the moving of services between cloud and on-prem as well as between clouds. We are also applying auto scaling, at a large scale. We haven't previously had that.

Previously you had no auto scaling or was it just not as 'auto' as you wanted it to be?

We had some auto scaling, but it wasn't fully automated. We're moving to a hybrid cloud model where previously we were mostly on-prem. Auto scaling is getting even more important when you are in the cloud. You can more easily scale your capacity in cloud than you can on-prem.

What are the goals for the talk and what do you want people to take away from the talk?

One of the goals is to take the audience on this journey from being a small company, which Uber was when I started, where everything could be managed manually. I'd like to take people through the stages of how and what you can automate as you grow the company and want to keep the system lean and want to keep releasing at a high velocity every day. And we've been able to keep that cadence even as the number of engineers has been increasing, as the complexity of our systems have been increasing as we have come to this multi-cloud model that we're deploying stuff to. I'd also like to show people how you can deploy fast and independently in a safe way. Show all the safety mechanisms that go into making these fast deploys work at scale for thousands of engineers. There's some descriptions of that in the blog post that was the origin for this talk but we've done a bit more work since then.

When you say "safe" deploys, what do you mean specifically?

Safe means that you can have confidence that even if you've made a change that isn't good to the code, you can still trust the system to be able to roll back the change or failover or somehow mitigate that outage for you before bad things happen. Safety is basically the ability that enables you to be able to move fast in a big system. You should be able to trust the system to save you when you do something that isn't correct.


Speaker

Mathias Schwarz

Software Engineer @Uber

Mathias Schwarz has been an infrastructure engineer at Uber for more than 5 years. He and his team is responsible for the deployment platform for stateless services used across all of Uber engineering. Mathias has a PhD in Computer Science from the Programming Languages group at Aarhus University.

Read more

From the same track

Session

Production Infrastructure Cloning++: Reliability and Repeatability

Thursday Nov 5 / 12:00PM EST

A QA tester greenlights a production deployment which works in their environment but fails in production because of a mismatch in a private dns between the environments. A load test passes in a load testing environment although the same scenario would fail in production because of a mismatch in...

JD Palomino

Backend Infra Engineer @flexport

Session

Paving the Road to Production

Thursday Nov 5 / 01:40PM EST

"Paved roads" are the paths walked by developers to get their code into production. They are a contract that core teams (like infrastructure and security) have with developers about features and support they will receive if they tread a common path. The benefits of staying on a paved...

Graham Jenson

Infrastructure Tech Lead @Coinbase

PANEL DISCUSSION

Paths to Production Panel

Thursday Nov 5 / 02:30PM EST

The Speaker is preparing the presentation abstract. More details will be available soon.

JD Palomino

Backend Infra Engineer @flexport

Mathias Schwarz

Software Engineer @Uber

Graham Jenson

Infrastructure Tech Lead @Coinbase

View full Schedule

Less than

15

weeks until QCon Plus May 2022

Level-up on the emerging software trends and practices you need to know about.

Deep-dive with world-class software leaders at QCon Plus (Nov 1-12, 2021).

Save your spot for $549 before February 7th

Register