We have the following ES cluster setup and I wonder if it is possible to speed up aggregation in Kibana.
- We index our IIS logs into daily indexes in Kibana, and each index is about 10GB with 2 primary shards and 1 replica
- We have a dashboard of 8-10 visualizations that present most of info from IISlogs
- For normal usage, we need see data within the last 1 hour to last 24 hours, in which response time is acceptable now
- We also need to look at a larger timeframe, from 1 to 4 weeks, to get the big picture of how our sites perform over time.
- 2 data servers: 2x Intel Xeon X5650 2.67GHz; 192GB RAM, 1x3 SSD storage (RAID5)
- 1 master server: normal hardware, runs master only node
2 data servers run:
- 1 master only ES instance with 4GB heap size
- 1 data only ES instance with 30GB heap size
- 1 client only ES instance with 16GB heap size (remains off, only turn on for testing)
In total, the ES cluster has 3 master nodes, 2 data nodes and 2 optional client nodes on each server.
When testing dashboard response time, no indexing activity occurs.
As I said, response time when looking at last 1 hour or 24 hours data is ok, but when we look at last 7 days, the dashboad takes longer to refresh than we expect. It's also important to note that we have not fully indexed our IIS logs, 7 millions records for the last 7 days is about 5% of the actual data.
Dashboard with 8 visualizations
Based on query and request duration, it looks like that the aggregation part is the most time consuming. We gather Perfmon data of the two servers into ES and here are some metrics:
- Very low heap size usage on all ES instances
- 130GB free memory on both servers
- No page file usage because I have disabled paging.
- Average CPU processor time: 45%
- Processor Queue Length: 0 most of the time
- Disk second per read/write: < 5ms
We do see about 2-5% CPU spike when the dashboard refreshes. but other metrics look fine.
- Is there any kind of bottleneck that make the aggreation take that long or is it normal because we are aggregation too many records at the same time on a dashboard?
- Is there anyway to improve response time over large amount of records? Looking at monthly data is the minimum timeframe target for now.