You are viewing content from a past/completed QCon Plus - November 2021

Session

Optimizing Efficiency & Capacity Management at Web Scale on the Cloud

Managing capacity demands while maintaining efficiency for a web-scale workload running on a public cloud is a challenging task. In this talk, Molly will share insight on how Pinterest optimizes their use of the cloud, concurrently maintaining demands across key domains of security, availability, rate of innovation, and infrastructure efficiency.

Possible dimensions to be covered:

  • Ensuring sufficient capacity for both critical and non-critical workloads while working with a fungible and multi-tenant cloud capacity pool
  • Improving efficiency via financial instruments such as long-term reserved leases and on-demand reservations
  • Driving accountability with engineering teams through attributing cloud costs

After this session, you will be armed with key strategies to ensure that your scaling exercise on the cloud is more successful and avoid key hiccups on the way.

Main Takeaways

1 Hear how Pinterest manages cloud capacity and efficiency.

2 Learn about strategies to manage your company’s cloud resources and infrastructure spend as you scale.


What is the work you're doing today?

I am the Technical Program Manager for what we call the Infrastructure Governance Program here at Pinterest, which is a cross-functional program responsible for managing cloud resources at Pinterest. As part of that, I manage our internal capacity request process and our infrastructure budgets; I manage the relationship with Pinterest’s cloud provider, which is Amazon Web Services; I oversee the work stream to manage our and for infrastructure cost and usage data, as well as our efficiency operations.

What are your goals for this talk?

I'm hoping that people will leave with an understanding that operating efficiently and effectively on the cloud is a hard problem to solve. No matter at what size or scale companies are operating. I want to share some tools and strategies that others can leverage to help them be more effective as they scale.

When you say tools, do you mean software tools or management tools or ways to think about efficiency in your products?

Primarily management tools.

Can you give us a preview of what we can look forward to?

For example, the cloud principle of infinite scalability--that you can spin up as many compute instances as you need at any time--begins to break down once you get to a certain size. One of the things that we have had to deal with, in addition to trying to stay efficient and keep our cloud spend within targets, is also to ensure we have the capacity we need to operate our workflows and maintain availability. One of the strategies we had to learn is how to protect capacity for critical workflows while at the same time keeping costs manageable.

When you say protect, what do we mean by that?

We have multiple tools at our disposal. For example, we use an AWS capability called on demand capacity reservations, or ODCRs, to guarantee the minimum cloud capacity required to run our most critical services. And we also have a close relationship with our cloud provider team. One of the most important tools for us is just good bidirectional information sharing with AWS. For example, we meet with AWS regularly and let them know what major changes we expect within our EC2 fleet, which means that internally we need to have a pulse on what migrations our service teams are planning. If AWS anticipates they may not immediately have that capacity available, we can coordinate together on a plan and timing to achieve that capacity and proceed with our migrations.

What sort of challenges do you have at Pinterest? Do you have seasonality, do you have big releases where you can know this is going to be an increase?

There is some seasonality to our workloads, for example, the holiday season is one of the most critical times of the year for us, when a lot of people use Pinterest to seek inspiration to find holiday gifts or Halloween costumes, so we do see a big bump in our traffic. We also have a very complex stack. At Pinterest, we have opted to use the cloud for primarily IaaS, or Infrastructure as a Service, such as file storage and compute processing. Our teams are supporting workloads varying from our internal CI/CD pipelines, to machine learning workloads powering personalization on the platform, to the services supporting online serving, just to name a few. Making sure that we can keep complexity manageable internally while at the same time having top notch availability is always a challenge.


Speaker

Molly Junck

Technical Program Manager & Infra Governance and Cloud Vendor Management @Pinterest

Molly Junck is a Technical Program Manager, leading the Infrastructure Governance Program at Pinterest. Molly is responsible for supporting Pinterest’s capacity management, cloud infrastructure cost and usage data, as well as managing the relationship with Pinterest’s cloud provider....

Read more

Date

Wednesday Nov 3 / 12:10PM EDT (40 minutes)

Track

The Cloud Operating Model

Topics

Cloud ComputingDevopsInfrastructure

Add to Calendar

Add to calendar

Share

From the same track

Session Cloud Native

Netflix Drive: Building a Cloud Native Filesystem for Media Assets

Wednesday Nov 3 / 11:10AM EDT

Netflix Studios produces hundreds to thousands of movies, shows, trailers, and other forms of media content each year which amount to hundreds of petabytes of storage and billions of media assets. These assets are created, edited, managed, encoded, and rendered by artists working on a multitude...

Tejas Chopra

Senior Software Engineer Data Storage Platform team @Netflix

Session Cloud Computing

K8s: Rampant Pragmatism in the Cloud at Starling Bank

Wednesday Nov 3 / 01:10PM EDT

Starling Bank’s back end is made up of 40 - 50 individual services. These were all being deployed to the cloud as a monolith, an approach that was slowing us down. We migrated our delivery pipelines to deliver these services in six separate groups, and are now starting to use Kube. But why...

Jason Maude

Lead Engineer @StarlingBank

PANEL DISCUSSION Cloud Computing

Panel: Kubernetes at Web Scale on the Cloud

Wednesday Nov 3 / 02:10PM EDT

Although many architectural and design similarities exist between large scale Kubernetes footprints whether on-prem or in the cloud, there are also key differences. Gaining insight on these will help ensure your k8s scaling exercise in the cloud can be accelerated through application of best...

Harry Zhang

Tech Lead of the Cloud Runtime Team @Pinterest

Ramya Krishnan

Staff Site Reliability Engineer @Airbnb

Ashley Kasim

Tech Lead of the Compute team @Lyft

View full Schedule

Less than

23

weeks until QCon Plus May 2022

Level-up on the emerging software trends and practices you need to know about.

Deep-dive with world-class software leaders at QCon Plus (Nov 1-12, 2021).

Save your spot for $499 before January 10th

Register