Hello,
I've been trying to find the cause of a high load the happens randomly on one of my hot nodes for a couple of months.
Currently I have a 25 nodes cluster, with 3 dedicated masters, 18 warm nodes and 4 hot nodes, all the indexing is done in the hot nodes.
The version is 7.12.1
, self-managed on VMs, Platinum licensed, an update is already planned for next year, and is not possible to be done now.
All the 4 nodes have the same specs, 12 vCPU, 64 GB of RAM with 30 GB of heap (bellow the java oops threshold) and SSD backed disks.
The ingestion is done using Logstash, but some Filebeats send data directly to Elasticsearch, the logstash outputs are configured with the four hosts, and the same applies to the filebeat outputs, so the requests should be balanced between the 4 nodes.
My indices are mostly daily indices (indexname-yyyy.MM.dd
) and some are monthly (indexname-yyyy.MM
), they are all sized in a way that the shard size is kept under 50 GB, I only use ILM to move indices from hot nodes to warm nodes and to delete them after some time.
Some indices, the larger ones with high e/s rates, have 4 shards and 1 or 0 replicas, other indices have 2 shards and 1 replica or 1 shard and 1 replica, so in some cases I could end with index that have shards on only 2 of the hot nodes.
The issue is that from time to time one of the nodes will have a high load that can be the double of the other ones for example, and higher than the number of vCPUs, which would suggest some I/O wait.
The problem is that I wasn't still able to track what can be the cause, I suspect that it could be caused by one or more index that will concentrate on some node, but I couldn't find an easy way to track this.
For example running GET _nodes/current-high-load-node/hot_threads?threads=5
, sometimes return [write]
threads as one or more hot threads, but there is no further information about what is the cause.
Running:
GET /_cat/thread_pool/write?v=true&h=node_name,active,queue,rejected,completed&s=node_name
Shows me that one of the nodes is using all the thread pools and have some requests in the queue, but how can I find what is causing this queue? Anyone has any idea?
node_name active queue rejected completed
node-name-01 1 0 0 57832635
node-name-02 12 63 0 61280527
node-name-03 5 0 0 57961637
node-name-04 2 0 0 51770382