You are viewing content from a past/completed QCon - November 2020

Solving Mysteries Faster with Observability

What You'll Learn

1Hear about Netflix Edgar, a distributed troubleshooting platform.

2Learn about the importance of observability and what Netflix does about it.

Everyone loves a good mystery, but not when it involves operating our services. Investigating production issues in a microservice architecture can make you feel like a detective, combing through evidence and gathering clues to reconstruct the scene of a crime, all while the clock is ticking. You hop from log store to dashboard, digging for details as you strive to unravel what really happened. All this time spent investigating is expensive, for engineers as well as customers -- and even then, finding an issue is not the same as resolving it!  

Edgar is a tool used and built by Netflix engineers to quickly investigate and solve production issues. Edgar starts with distributed tracing, which shows a request’s path through a complex system. But the request’s path is only a small part of the data available about a request. Dozens of dashboards hold their own insights on what happened, and it takes time for engineers to jump between individual dashboards. Edgar strives to get all this data in one place, supplementing traces with additional context like log correlation, metadata about services, and intelligent analysis. Not only does this help Netflix engineers investigate more efficiently, it empowers our customer service operations to access the same information. Our engineers and customer service operations rely on Edgar so they can quickly get our members back to enjoying their favorite movies and shows. 

You’ll leave this talk with an understanding of how we enhance distributed tracing with additional context like logs and metadata to resolve issues. You’ll see examples of how Edgar has paid off and hear about challenges we faced. Finally, we’ll inspire you to leverage the data you already have to help you and your team solve mysteries faster.

What is the work you're doing today?

I work on Edgar, which is a distributed troubleshooting platform. Edgar's main vision is helping engineers and all our users at Netflix solve problems faster. We're trying to utilize all the data sources we have to get insights out of that data in our UI so that users can dig less on their own. A lot of time troubleshooting means a lot of manual work for users. We're trying to aggregate that manual investigation into one place to speed up resolution.

With Edgar you're taking distributed tracing data and you're pairing it up with logs and other tools.

Exactly, yes. We use the distributed trace I.D. as a breadcrumb that we can search for in other sources. You're tagging your logs with your distributed trace I.D. and then you can also tag your logs with metadata. Let's say you're trying to watch a movie on Netflix. There's an I.D. for that movie or the content. In Edgar, we're able to take that trace I.D. and then go fetch the logs that are associated with it. And then also say, OK, you were trying to watch this piece of content. And here's what we know about that piece of content. In the long term, we'd love to have metrics and trends in there as well. But in the meantime, we aggregate traces, logs, and metadata by using that trace I.D. as the token that we can search for.

The trace ID is per request. Do you also sort of correlate data for, eg, a session?

That's a great question. The trace ID is per request. But yeah, we do have an umbrella over that, which is a session I.D.. You can look at all the steps within a session. At Netflix, somebody is going to press play and the content is going to start streaming. And then at some point they'll press pause or finish a show and the content will stop. And so in that umbrella of a session, there are certain events that need to happen. And so we're able to look for these things within a session to know what went wrong or what didn't happen or did happen inside of that broader context.

So you know what needs to happen? You can search for that in Edgar's database and the traces.

Exactly. In order for content to play successfully, there is a set of steps that need to happen. We need to get a license for the content and check if we can play that content in your area, for example. We need to figure out what resolution to provide based on your preferences and what your device supports and all of that. There's this checklist of steps that have associated logs. We’re able to look for the license acquisition, for example, and show if something went wrong in that phase. With Edgar, we want to help our users solve problems faster by correlating those pieces programmatically.

In solving problems programmatically, are you also analyzing the logs with Jupiter or analysis tools?

We don't use Jupyter notebooks in Edgar. We have an anomaly detection tool, Telltale, that analyzes trace data.

Is Edgar a tool that's been coming along at Netflix over the past few years?

Yes. The team started working on it in late 2016. It's grown in scope since we started. Initially, we were only focused on the playback experience and streaming video. We started with this targeted use case, then we found it was really useful for engineers outside of streaming video. Then we found that Edgar was useful to more personas than engineers, and we were able to expand more. From my perspective, looking at the observability and what trends are happening, I feel we're seeing more people talk about getting the three pillars of observability under the same roof, and we thought it was timely to talk about Edgar more broadly.

What are your goals for the talk? What are the takeaways?

I hope that people leave excited about observability and see the potential of correlating logs and traces. I have some concrete takeaways. I hope that users start tagging their logs with their trace IDs, for example, which would enable them to do some of the footwork that Edgar does, but on their own. I also hope that people leave with the thought that here are some things that Netflix experimented with that we could experiment with, too. Because that's really how Edgar started at Netflix. It was an opinionated experiment. I hope attendees are interested in what we experimented with and perhaps try something similar in their own organizations.


Elizabeth Carretto

Senior Software Engineer @Netflix
Elizabeth Carretto is a Senior Software Engineer at Netflix in Productivity Engineering, where she builds UIs for the observability space. Her work focuses on delivering value from observability data to service operators through products like Edgar, a troubleshooting tool built on top of... Read more

Tuesday Nov 10 / 02:40PM EST (40 minutes)

TRACK Operating Microservices ADD TO CALENDAR Calendar IconAdd to calendar

Build your learning journey and level-up on the skills most in-demand in 2021. Attend QCon Plus (May 17-28, 2021).

Save your spot for $599 before May 28th