Es 5.5 cluster unsearcable, node stats time-outs

Lately our logging elasticsearch cluster started acting up, crashing a node once in a while because of heap out of memory issues. I started putting more resources into our 6 nodes, continued upgrading from 2.4 to 5.5 and now I'm stuck with cluster that does index data, but will not answer my queries.
Right now its 10 nodes, 8 data, 2 master. around 1080 indices of 10-50gb in 11500 shards. Its gotten big, maybe too big.
Now it keeps timing out requests and there is a lot of node stats errors in the logs on the master:
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticdb02pl][10.77.168.41:9300][cluster:monitor/nodes/stats[n]] request_id [233363] timed out after [15000ms]

I tried putting more resources into the cluster:
*gave the data nodes more memory, now at 32g and 20g for heap
*more cpu, now at 6 vcpu was 4
*more nodes, was at 2 masters + 6 data, now at 2 masters + 8 data.
It still wont answer my prayers... errr requests...

I dont know where the bottleneck is. They are all in the same VLAN with no firewalls, HW is VMware 6.0, Storage is NFS, 1 mount of 9TB per node, apprx. 65% full.
It ran fine, until it didnt...

Any suggestions please?

/Martin

That's bad, there is no majority of 2, see Important Configuration Changes | Elasticsearch: The Definitive Guide [2.x] | Elastic

That's too many shards, use _shrink to reduce that so each index is a single shard.

Based on the data you have provided you have around 6TB of data per node with an average shard size of 4GB. In order to be able to hold as much data as possible per node, you should aim for a larger shard size, ideally somewhere around 20-30GB in size. Use the shrink API to reduce the number of primary shards to 1 as Mark suggested. Start with the smallest indices. You can also reduce overhead by running a force merge down to a few segments (is I/O intensive) once the shrink operation has completed. Only do this for indices no longer being indexed into.

I would also recommend installing X-Pack Monitoring if you have not already, as this will give you a better idea about what is going on.

Also add another master node as per Mark's suggestion. You always want 3 dedicated master nodes so that you can lose one and the remaining ones are able to form a majority and elect a master.

What about new indices? I have it set to 8 now, to split new indices over all 8 data nodes to get max indexing performance.

If you need to have 8 primary shards I would recommend having each index cover a longer time period, e.g. a week or a month. It is also possible to index into 8 primary shards and then use the shrink API to reduce the number of primary shards when it is no longer indexed into.

You can also use the rollover API to cut indices based on size rather than time to ensure you don't end up with too many small shards, which is inefficient. If you have a long retention period, all indices does not necessarily need to cover the same amount of time.

Given your volume size, you don't really need to worry about index performance at this point.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.