How can I speed up Kibana aggregation?


(Anh) #1

Hi All,

We have the following ES cluster setup and I wonder if it is possible to speed up aggregation in Kibana.

  • We index our IIS logs into daily indexes in Kibana, and each index is about 10GB with 2 primary shards and 1 replica
  • We have a dashboard of 8-10 visualizations that present most of info from IISlogs
  • For normal usage, we need see data within the last 1 hour to last 24 hours, in which response time is acceptable now
  • We also need to look at a larger timeframe, from 1 to 4 weeks, to get the big picture of how our sites perform over time.

Hardwares:

  • 2 data servers: 2x Intel Xeon X5650 2.67GHz; 192GB RAM, 1x3 SSD storage (RAID5)
  • 1 master server: normal hardware, runs master only node

ES instances:
2 data servers run:

  • 1 master only ES instance with 4GB heap size
  • 1 data only ES instance with 30GB heap size
  • 1 client only ES instance with 16GB heap size (remains off, only turn on for testing)

In total, the ES cluster has 3 master nodes, 2 data nodes and 2 optional client nodes on each server.
When testing dashboard response time, no indexing activity occurs.

As I said, response time when looking at last 1 hour or 24 hours data is ok, but when we look at last 7 days, the dashboad takes longer to refresh than we expect. It's also important to note that we have not fully indexed our IIS logs, 7 millions records for the last 7 days is about 5% of the actual data.

One visualization

Dashboard with 8 visualizations

Based on query and request duration, it looks like that the aggregation part is the most time consuming. We gather Perfmon data of the two servers into ES and here are some metrics:

  • Very low heap size usage on all ES instances
  • 130GB free memory on both servers
  • No page file usage because I have disabled paging.
  • Average CPU processor time: 45%
  • Processor Queue Length: 0 most of the time
  • Disk second per read/write: < 5ms

We do see about 2-5% CPU spike when the dashboard refreshes. but other metrics look fine.

My questions:

  1. Is there any kind of bottleneck that make the aggreation take that long or is it normal because we are aggregation too many records at the same time on a dashboard?
  2. Is there anyway to improve response time over large amount of records? Looking at monthly data is the minimum timeframe target for now.

Thanks,


Slow percentile aggregation: Is it unusual or normal?
(Matt Bargar) #2

It looks like the query in Elasticsearch is executing relatively quickly, but the request as a whole is very slow. I would start by loading the dashboard page with your browser developer tools open to the network tab and looking at where the most time is being spent (see example below). My guess is that you're pulling a large amount of data over the wire, but there could be other issues that are making the requests slow. The network tab should also tell you how large the response payloads are.


(Anh) #3

Below are results from Dev tools

1st time load

Refresh

Looks like it took a long time for the browser to receive the first byte from the server.


(Matt Bargar) #4

Wow, you're telling me! So either elasticsearch is taking a very long time to respond, or there's a bottleneck on the Kibana backend. Could you try running the same query against Elasticsearch directly while timing it? You should be able to do this with Curl, Postman or even Sense. That should tell us if the problem lies with Kibana or Elasticsearch at least.

Also, what versions of Kibana and Elasticsearch are you using? And is there anything else special about your setup, like a proxy in front of Kibana perhaps?


(Anh) #5

I'm running ES 2.3 and Kibana 4.5. No proxy in front of Kibana, and Kibana run on the server physicall server with the ES node that it queries agaists. I've tried to point Kibana to Master only node, Data node, and Client node on the server server but response time remains the same in each case.

I'll get the query body and run in Sense to see how long it would take.

Thanks,


(Anh) #6

I think I found the cause of slowness. I have 10 visualizations on the dashboard, and the longest one, the percentile visualization on time-taken over time, accounts for 90% of the request duration. Wonder why percentiles graphs are so expensive.

After removing this visualization:

Query Duration 1337ms
Request Duration 1949ms
Hits 12937260


(Matt Bargar) #7

Awesome, I'm glad you were able to narrow the problem down! Since the problem seems to lie with this one query, you might want to post a follow up question about it in the Elasticsearch section of the forums. Unfortunately I'm not an expert in analyzing query performance.


(Anh) #8

Thanks, I'll create a new topic in Elasticsearch section.


#9

Hi @anhlqn
Can you please share how to determine the running times for visualizations in a dashboard?
How did you find out the lengthy one?

Thanks


(Anh) #10

In my case because I added the percentile visualization to the dashboard last, I thought it could be the cause of slowness.


(system) #11