Session

Co-Designing Raft + Thread-per-Core Execution Model for the Kafka-API

What You'll Learn

1Hear about building a storage system with Raft, with no virtual memory, no page cache for small predictable latencies.

2Find out the importance of writing applications for modern hardware to benefit from its performance.


Sometimes you get to reinvent the wheel when the road changes. Redpanda is a drop-in replacement for Apache Kafka®, designed from the ground up for modern hardware. Hardware looks nothing like it did 10 years ago. NVMe disks are 1000X faster than spinning disks. Cloud computers offer 30X more cores than they did a decade ago with NICs offering 100x higher throughput.  

In this talk we will talk over the lessons learned building a new storage engine from scratch with no virtual memory, no page cache, with purpose-built read-ahead and write-behind strategies for predictable latencies, co-designed with Raft - our data replication model.

What is the work that you're doing today?

For the last two and a half years, I really have been focused on trying to extract as much performance out of the hardware as possible. To summarize, the hardware is fundamentally different to what it was a decade ago. I think very few software projects have been written with modern hardware in mind. And by that, I mean true multicore systems. On Google, for example, you get to rent 225 cores and with NVMe SSDs the bottleneck for at least for the systems that I work on, which are storage systems, has moved out of the IO path, which 10 years ago was in the millisecond space, now it's in the microsecond space. When you look at it, you have to redesign software from scratch to take advantage of modern hardware. The way I like to frame it is that sometimes you get to reinvent the wheel when the road changes significantly. My day-to-day work is we're building a new storage engine to take advantage of modern hardware and it has a Kafka API. And so for any application that was written against the Kafka API, any streaming application whether it is Spark streaming or TensorFlow or your own application that uses producers and consumers from the Kafka API, it should just work except the storage engine was designed from the ground up to take advantage of modern hardware. The majority of my days are spent training and mentoring some younger engineers in my team and doing a ton of code reviews. But the gist of what I work on is really this new storage engine for modern hardware. 

What are your goals for the talk?

I think there are two major things that I want people to walk away from my talk, and one is to realize just how different it is to write software that is designed for the platform. To me, the platform is always the hardware. Software does not run on category theory. It runs on these super scalar CPUs with gigabit per second memory banks and the same thing with NVMe. And you have to think of latencies in the microseconds. It is actually a totally different platform than it was a decade ago, especially for people building systems software. That's a really big takeaway. The techniques used in software, it will work just simply because hardware is so fantastic, but you're leaving a lot on the table. One of the impacts that I will highlight in the talk is that you can get 10-100x tail latency improvements if you design for modern hardware. So it takes new techniques for software construction for modern hardware. That's the mechanical level for the software engineer and the software architect.

And then for folks that are used in storage systems, I think there was this popular belief that you have to either choose performance or data safety. And by that, I mean no data loss with any storage subsystem. That was a common belief that you can't get both low latency, high throughput, and safety. You had to pick one. You had to either pick speed or pick safety. And what we've shown is that if you design your software for the platform and the platform is always the hardware, then you no longer have to choose between safety and performance. Embracing the platform allows us to have better guarantees to the end user of our software. Those are the main two takeaways, and I think the latter is much more subtle and it's really hard to describe without being too marketing and showing a bunch of performance benchmarks. In the talk, I go pretty deep into techniques that have been used in software for a long time, especially in high-frequency trading systems or in mission-critical systems, where one technique, for example, is virtual memory, and we preallocate the whole memory of the machine at bootstrapping. Without giving my full talk away, there are a set of eight or nine critical techniques to extract the performance of hardware, and then that starts to change what you think is possible now. What you think are the defaults that you get to expose to your users, and so those I would say it would be the main two things that people should walk away from the talk. 

Picking up on the virtual memory aspect, do you completely turn off virtual memory or do you pre-commit all pages?

What we do is provide a wrapper on top of the allocation library in C++. The code that I've worked on for a really long time in C++ systems and low latency space as well, or embedded space, what we do is when the program starts, it allocates one hundred percent of whatever memory you allow it to allocate. Typically by default, it's somewhere around 90, 90 something percent. And we leave some for the operating system task effectively. But from a user perspective, it allocates the whole machine's worth of memory, and then it splits the memory across all of the course evenly. So if you have 10 cores and you allocate 20 gigabytes of memory, every core will have 2GB. Here's the trick. Once it is allocated, memory allocations are only core-local, and so a memory allocation is extremely low latency because the effect is simply bumping a pointer. When you make a memory allocation the core-local  allocator will simply give you memory. There's a big tradeoff there and I go somewhat into a deep dive in my talk. But what happens is when you don't have virtual memory, all sorts of problems that the kernel handles for the classical way of building applications, surface to your application, you know, which is how do you handle large buffers? How do you handle memory fragmentation? How do you start to build software that can cross memory boundaries, but the allocation has to be core local. So you have to build a little bit of primitives and it is more work today. We use a framework, but even that is too low level. I think we need higher software primitives, but in effect, you have to rebuild a lot of the primitives that the Linux kernel gives you on top of it. 

And for us, it is a worthwhile trade-off because we're building a very specific application, which is we're building a drop-in replacement for Kafka. That is the single binary. And so we understand the performance characteristics all the way to the application level. That is when a customer connects with their Kafka application, we understand the entire semantics and lifecycle of the user, the readhead mechanics, the prefetching, getting the right strategies for writing to disk. There's a complete understanding of the protocol. And so for us, using the Linux kernel facilities often introduces non determinism in the IO path. If you are aiming to extract a low latency and high throughput out of the existing hardware, you need to control every aspect of it. It's really a little bit of a controversial subject. Why not use virtual memory? Why not use the Linux kernel page cache, etcetera. Hopefully, that gives you a hint as to the problems that we're trying to solve. 


SPEAKER

Alex Gallego

Founder and CEO @VectorizedIO
Alex Gallego is the founder and CEO of Vectorized, where he & the team hack on Redpanda, a modern streaming platform for mission critical workloads. Prior to Vectorized, he was a principal engineer at Akamai, as well as co-founder and CTO of Concord.io, a high performance stream processing... Read more Find Alex Gallego at:
DATE

Wednesday May 19 / 10:10AM EDT (40 minutes)

TRACK Modern CS in the Real World TOPICS Applied Computer ScienceKafkaStreaming DataCloud ComputingDevopsDatabaseBig Data ADD TO CALENDAR Calendar IconAdd to calendar
SHARE
TOP