You are viewing content from a past/completed QCon Plus - November 2020


Building Latency Sensitive User Facing Analytics via Apache Pinot

Real-time analytics has become the need of the hour for modern Internet companies. The ability to derive internal insights around business metrics, user growth & adoption as well as security incidents from all the raw logs is crucial for day to day operation. Even more critical is enabling access to usage analytics for the millions of customers which is non-trivial to achieve.

A good example of this is LinkedIn’s ‘Who Viewed My Profile’ which allows all 700 million+ users to slice and dice their page view data. Another example is Uber’s Restaurant Manager which enables restaurant owners across the globe to gain insights around menu preference, sales metrics, busy hours and so on. All such user facing applications need an analytical store that can support 1000s of queries per second at a millisecond response time granularity while ingesting millions of events/second.

In this talk, we will elaborate on how this is made possible using Apache Pinot - a popular, open source, distributed OLAP store. Specifically, we will talk about how to maintain the p99th latency SLA in the presence of organic data growth and concurrent queries.

Main Takeaways

1 Find out how LinkedIn, Uber and other companies managed to have low latency analytical database queries in spite of high throughput.

2 Learn about the internals of Pinot that makes it possible to obtain low latency, and the principles that Pinot uses.

What is that work you're doing today?

My most recent work experience was at Uber, where I was leading the streaming team. The mission was to build a platform around Kafka, Flink and Pinot and power real time analytics for different business groups. This platform supports different classes of users in Uber. For instance, we had engineers building their own services on top of this analytical platform. They could author their own ingestion pipelines to transform the data before it's loaded into something like Pinot, and then do their own queries using some RPC calls. At the same time, we also have some non-technical audience (eg: operations personnel) and data scientists who may or may not know programming. For those people, we developed a one-click experience. We provide a UI tool wherein you could drag and drop from your data sources, add a preprocessor in a visual way, load the data into Pinot and then query using some standard SQL. While simple for the user, underneath there's so much complexity to handle. We need to deal with query parsing, optimization, logical schemas, handling of job deployment across different regions, automatic failure handling and scaling of those pipelines.

What is the goal for the talk?

I want to focus on one component: Pinot, which is an open source, distributed analytical database. The emphasis is on how we can achieve a very low query latency despite having high throughput. This might seem trivial for a standard database (eg: MySQL), but it is non-trivial for analytical databases because of the complex queries that they execute. The talk will discuss use cases that need guaranteed low latency for analytical queries and how you can tune Pinot to achieve this goal.

What would you want the people to live your talk with?

Knowledge about Pinot, and how they can use Pinot in their own environments to build such use cases, that would be one. And the second one would be how the principles of this solution can be applied to some other system like Pinot, because the principles are easily leverageable across different technologies. For example, we talk about how the data segments are distributed across multiple nodes and how the query optimization is done for minimizing the footprint. This can actually be applied to many other systems, so people can learn how Pinot does this in a better way.


Chinmay Soman

PMC Member/Commiter @SamzaStream

Chinmay Soman has been working in the distributed systems domain for the past 10+ years. He started out in IBM where he worked on distributed file systems and replication technologies. He then joined the Data Infrastructure team in LinkedIn and worked on open source technologies such as Voldemort...

Read more
Find Chinmay Soman at:


Tuesday Nov 17 / 02:40PM EST (40 minutes)


Modern Data Engineering

Add to Calendar

Add to calendar


From the same track


Modern Data Engineering Panel

Tuesday Nov 17 / 03:30PM EST

Data Engineering is a vast field that concerns itself with efficient access to data based on the needs of a business. Though data is the prized entity from which a company extracts insights, data doesn't exist in a void. It first needs to be stored somewhere and then an API needs to be...

Shrijeet Paliwal

Sr. Staff Software Engineer @Tesla

Chinmay Soman

PMC Member/Commiter @SamzaStream

Chris Riccomini

Distinguished Engineer @WePay


Serverless Search for My Blog With Java, Quarkus, & AWS Lambda

Tuesday Nov 17 / 01:00PM EST

A Serverless app? With Java?! Absolutely!We’ll discuss when Serverless is a great fit (and when it isn’t!) and why you don’t need to leave the Java platform when going Serverless. Based on the real-world example of a Serverless blog search, you’ll learn how Quarkus and...

Gunnar Morling

Open Source Software Engineer @RedHat


Designing IoT Data Pipelines for Deep Observability

Tuesday Nov 17 / 01:50PM EST

Millions of IoT devices emitting trillions of events per day enable us to track the health of the Tesla fleet. From a data engineering perspective, it's a challenging scale, but what makes it unique is how naturally fragile the data pipeline is. The physical world is full of chaos: what if...

Shrijeet Paliwal

Sr. Staff Software Engineer @Tesla

View full Schedule

Less than


weeks until QCon Plus May 2022

Level-up on the emerging software trends and practices you need to know about.

Deep-dive with world-class software leaders at QCon Plus (Nov 1-12, 2021).

Save your spot for $499 before January 10th