Robust Foundation for Data Pipelines at Scale - Lessons From Netflix

What You'll Learn

1Hear about Netflix’s orchestration platform, how they built and operate it.

2Learn about the challenges they encountered and lessons learned managing hundreds of thousands of pipelines over the years.

3Find out what are some of the best practices to be used in the workflow lifecycle management.

At Netflix, Data/ML pipelines are widely used and have become central for the business. A very wide scenario presents diverse use cases that go beyond recommendations, predictions and data transformations. As big data and ML gains presence and becomes more impactful, the scalability and stability of the ecosystem have increasingly become more important for our data scientists and the company.  

Over the past years, we have developed a robust foundation composed of multiple cloud services and libraries, that provides users a consistent way to define, execute and monitor units of work. In the big data and ML space, our foundation is responsible for reliable executing a large number of Data/ML workflows containing tens of thousands parallel jobs, in addition to event-driven triggers and conditional branches.  

In this talk, we will share our experiences of building and operating the orchestration platform for Netflix’s big data ecosystem. We will talk about challenges we faced to manage hundreds of thousands of pipelines, and lessons we learned to automate them over the past years, such as fair resource allocation, scaling problems, and security concerns. We will also share best practices for the workflow lifecycle management and design philosophy for workflow automation, including patterns we developed and approaches we took.

What is the work that you are doing today?

Jun: I work in the Big Data Orchestration team at Netflix. Our team owns multiple Orchestration services. My work focuses on designing and building the Netflix workflow orchestrator, a robust and scalable platform to provide workflow as a service. It has been widely used by thousands of Netflix internal users. With the scale of Netflix, one of my main tasks is to develop the scheduler to not only support a wide variety of use cases but also be able to scale up and scale out to automate hundreds of thousands of data pipelines at Netflix. 

Harrington: I work in the Data Platform Orchestration team at Netflix. My work focused on building the next generation of scheduler tools. This means building all the orchestration layers for our users to be able to execute jobs and schedule DAGs. In addition to this, I also work on building the event-driven platform for the data platform. We want our platform to be smarter, and we have been adopting and leveraging events, to be able to react to anything that happens in the platform. This has allowed us to orchestrate executions based on events instead of simply relying on time-based mechanisms. For example, instead of saying, “I want to run this workflow at midnight because a specific metric I need is likely going to be ready around that time”, we allow our users to say “I want this workflow to run when the metrics I need are available”. By doing this, we make our platform more efficient because instead of running when you think you have to, you run when you actually have to. 

What are the goals for the talk?

Jun: In this talk, we want to share our experiences while building this robust platform. We would like to share the best practices and lessons we have learned while operating the platform for the past several years. For each of the components, we will talk about its design and the technical decisions we made to better serve our users. After the talk, the audiences will see the tradeoffs for building and managing a large data pipeline platform and can apply some of those principles and best practices to their work when they see a fit.

Harrington: We've been working on this for quite a while, several years or so. Over this time, we have learned that by separating everything into components or layers, we have been able to build a solid platform. This has worked out for us quite well. Whenever we have had to make changes or evolve our tools, we have been able to do it causing zero to minimal impact on our users and the applications that are built on top of us. 

I expect that during this talk, I can clearly communicate how the separation of concerns that we have adopted into our implementation has helped us build our platform. I also want people to understand what we have learned, as well as the benefits of each of the main components. Finally, I would like people to walk away understanding how the platform has been able to evolve and keep up with the scale.


Jun He

Sr. Software Engineer in the Big Data Orchestration Team @Netflix
Jun He is a Sr. Software Engineer in the Big Data Orchestration team at Netflix, where he is responsible for building the big data workflow scheduler to manage and automate ML and data pipelines at Netflix. Prior to Netflix, He spent a few years building distributed services and search... Read more


Harrington Joseph

Sr. Software Engineer @Netflix Data Platform Orchestration Team
Harrington Joseph is a Sr. Software Engineer at Netflix Data Platform Orchestration team. His work is focused on data orchestration and high-throughput event-driven architectures. Currently, Harrington is actively working on building the next generation of scheduling tools for Netflix Data... Read more Find Harrington Joseph at:

Thursday May 27 / 10:10AM EDT (40 minutes)

TRACK Modern Data Pipelines and Data Mesh TOPICS ArchitectureData PipelineMachine LearningDatabase ADD TO CALENDAR Calendar IconAdd to calendar

From the same track

View full Schedule