Systems that operate non-stop, 24/7 are standard in many consumer-facing industries. Often, but definitely not always, these systems do not have aggressive SLAs nor high availability needs to the degree that some financial systems demand. But that is changing. In this session, we will discuss lessons learned in designing systems, especially those based on replicated state machines, that need to continue operating. At the same time, things go wrong, components are upgraded piecemeal, and SLAs must still be met.
Speaker
Todd Montgomery
Ex-NASA Researcher and High Performance Distributed Systems Whisperer
Todd Montgomery is a networking hacker who has researched, designed, and built numerous protocols, messaging-oriented middleware systems, and real-time data systems, done research for NASA, contributed to the IETF and IEEE, and co-founded two startups. He currently works as an Engineering Fellow at Adaptive Financial Consulting and is active in several open source projects, including Agrona, Aeron, ReactiveSocket, and the FIX Simple Binary Encoding (SBE).