Fabricator: End-to-End Declarative Feature Engineering Platform

At Doordash, the last year has seen a surge in applications of machine learning to various product verticals in our growing business. However, with this growth, our data scientists have had increasing bottlenecks in their development cycle because of our existing feature engineering process. At a daily feature volume of over 500 unique features and 10B feature values, each component of the feature engineering process from feature generation, online materialization, offline serving and lifecycle management was becoming operationally intensive and low velocity.

To overcome these challenges, we designed an end-to-end declarative and central feature engineering platform Fabricator. This framework leverages simple high level YAML definitions to automate the feature pipeline orchestration using Dagster, perform scalable pipeline executions leveraging Spark on Databricks, and simplify feature store materialization and management via Redis. Additionally, the entire framework is continuously deployed, bringing iteration velocities down to just a few minutes.

In this session, we’d like to present how our Machine Learning Platform designed Fabricator by integrating various open source and enterprise solutions to deliver a declarative end-to-end feature engineering framework, and take a look at the wins this enabled us to deliver. In the end, we take a closer look at key optimizations and learning, and discuss plans for extending the framework for hybrid real time and batch architectures.


Speaker

Kunal Shah

ML Platform Engineering Manager @DoorDash

Kunal Shah is an ML Platform Engineering Manager at Doordash focusing on building a feature engineering platform. Over the last year he has launched declarative frameworks for both batch and real time feature development, accelerating the development lifecycle by over 2x. Previously, he has worked on ML Platforms and Data Engineering frameworks at Airbnb and YouTube. He finished his Compute Science undergraduate at IIT Bombay, and holds a Masters in Data Science from UC Berkeley.

Read more

Date

Wednesday Dec 7 / 10:10AM PST ( 50 minutes )

Track

MLOps

Topics

Machine Learning YAML Pipelines Batch Architectures Architecture

Share

From the same track

Session Machine Learning

Ray: The Next Generation Compute Runtime for ML Applications

Wednesday Dec 7 / 09:00AM PST

Ray is an open source project that makes it simple to scale any compute-intensive Python workload. Industry leaders like Uber, Shopify, Spotify are building their next generation ML platforms on top of Ray.

Zhe Zhang

Head of Open Source Engineering @anyscalecompute

Session Machine Learning

An Open Source Infrastructure for PyTorch

Wednesday Dec 7 / 11:20AM PST

In this talk we’ll go over tools and techniques to deploy PyTorch in production. The PyTorch organization maintains and supports open source tools for efficient inference like pytorch/serve, job management pytorch/torchx and streaming datasets like pytorch/data.

Mark Saroufim

Applied AI Engineer @Meta

Session Machine Learning

Real-Time Machine Learning: Architecture and Challenges

Wednesday Dec 7 / 12:30PM PST

Fresh data beats stale data for machine learning applications. This talk discusses the value of fresh data as well as different types of architecture and challenges of online prediction.  

Chip Huyen

Co-founder @Claypot AI

Session Machine Learning

Declarative Machine Learning: A Flexible, Modular and Scalable Approach for Building Production ML Models

Wednesday Dec 7 / 01:40PM PST

Building ML solutions from scratch is challenging because of a variety of reasons: the long development cycles of writing low level machine learning code and the fast pace of state-of-the-art ML methods to name a few.

Shreya Rajpal

Founding Engineer @Predibase