Building Latency Sensitive User Facing Analytics via Apache Pinot

What You'll Learn

1Find out how LinkedIn, Uber and other companies managed to have low latency analytical database queries in spite of high throughput.

2Learn about the internals of Pinot that makes it possible to obtain low latency, and the principles that Pinot uses.

Real-time analytics has become the need of the hour for modern Internet companies. The ability to derive internal insights around business metrics, user growth & adoption as well as security incidents from all the raw logs is crucial for day to day operation. Even more critical is enabling access to usage analytics for the millions of customers which is non-trivial to achieve.

A good example of this is LinkedIn’s ‘Who Viewed My Profile’ which allows all 700 million+ users to slice and dice their page view data. Another example is Uber’s Restaurant Manager which enables restaurant owners across the globe to gain insights around menu preference, sales metrics, busy hours and so on. All such user facing applications need an analytical store that can support 1000s of queries per second at a millisecond response time granularity while ingesting millions of events/second.

In this talk, we will elaborate on how this is made possible using Apache Pinot - a popular, open source, distributed OLAP store. Specifically, we will talk about how to maintain the p99th latency SLA in the presence of organic data growth and concurrent queries.

What is that work you're doing today?

My most recent work experience was at Uber, where I was leading the streaming team. The mission was to build a platform around Kafka, Flink and Pinot and power real time analytics for different business groups. This platform supports different classes of users in Uber. For instance, we had engineers building their own services on top of this analytical platform. They could author their own ingestion pipelines to transform the data before it's loaded into something like Pinot, and then do their own queries using some RPC calls. At the same time, we also have some non-technical audience (eg: operations personnel) and data scientists who may or may not know programming. For those people, we developed a one-click experience. We provide a UI tool wherein you could drag and drop from your data sources, add a preprocessor in a visual way, load the data into Pinot and then query using some standard SQL. While simple for the user, underneath there's so much complexity to handle. We need to deal with query parsing, optimization, logical schemas, handling of job deployment across different regions, automatic failure handling and scaling of those pipelines.

What is the goal for the talk?

I want to focus on one component: Pinot, which is an open source, distributed analytical database. The emphasis is on how we can achieve a very low query latency despite having high throughput. This might seem trivial for a standard database (eg: MySQL), but it is non-trivial for analytical databases because of the complex queries that they execute. The talk will discuss use cases that need guaranteed low latency for analytical queries and how you can tune Pinot to achieve this goal.

What would you want the people to live your talk with?

Knowledge about Pinot, and how they can use Pinot in their own environments to build such use cases, that would be one. And the second one would be how the principles of this solution can be applied to some other system like Pinot, because the principles are easily leverageable across different technologies. For example, we talk about how the data segments are distributed across multiple nodes and how the query optimization is done for minimizing the footprint. This can actually be applied to many other systems, so people can learn how Pinot does this in a better way.


Chinmay Soman

PMC Member/Commiter @SamzaStream

Chinmay Soman has been working in the distributed systems domain for the past 10+ years. He started out in IBM where he worked on distributed file systems and replication technologies. He then joined the Data Infrastructure team in LinkedIn and worked on open source technologies such as Voldemort and Apache Samza. Until recently, he was a Senior Staff Software Engineer in Uber where he led the Streaming and Real-Time Analytics Platform team. Currently, he’s a founding engineer in a stealth mode company.

Find Chinmay Soman at:

Tuesday Nov 17 / 11:40AM PST (40 minutes )

TRACK Modern Data Engineering ADD TO CALENDAR Add to calendar SHARE

3 weeks of live software engineering content designed around your schedule.

Don’t miss out! Save your seat now