We started using Elastic APM in the company and we have a bottleneck that causes Elasticsearch queries too long.
We have an Elasticsearch cluster with 8 nodes. Node types are on default. Each node has 16 vCPUs, 60GB (30GB heap size) memory and 2TB SSD data disk.
There are 2 APM servers with 16 vCPUs and 15GB memory.
Additionally, there is a Kibana server with 1 vCPU and 3.75GB memory.
In the Elasticsearch cluster, we currently have 80 shards (and no replicas) on transactions and errors.
With this calculation, our aim is keeping the size between 20GB-40GB for each shard.
While making a search on Kibana, all the Elasticsearch nodes' vCPUs touch the peak.
As other APM configurations;
- max_event_size is about 3mb
- queue.mem.events: 10240000
- output.elasticsearch.workers: 512
- output.elasticsearch.bulk_max_size: 20000
- setup.template.settings.index.number_of_routing_shards: 480
- setup.template.settings.index.refresh_interval: 180s
Our daily APM data is ~2TB.
Our problem is that we can't get a response when we want to see the APM dashboard. We have investigated the queries that Kibana APM Dashboard sends and selected an example query.
For example, when we want to see the data for the last 2 hours, it takes over 40 seconds. You can find the profiling results of this example search query on the link below.
We can't change the query since we don't have any control over the query that Kibana APM Dashboard sends.
Do you have any suggestion about infrastructure and configurations?
Thanks in advance.