Es 5.5 cluster unsearcable, node stats time-outs

martinrm77 · September 2, 2017, 10:36pm

Lately our logging elasticsearch cluster started acting up, crashing a node once in a while because of heap out of memory issues. I started putting more resources into our 6 nodes, continued upgrading from 2.4 to 5.5 and now I'm stuck with cluster that does index data, but will not answer my queries.
Right now its 10 nodes, 8 data, 2 master. around 1080 indices of 10-50gb in 11500 shards. Its gotten big, maybe too big.
Now it keeps timing out requests and there is a lot of node stats errors in the logs on the master:
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticdb02pl][10.77.168.41:9300][cluster:monitor/nodes/stats[n]] request_id [233363] timed out after [15000ms]

I tried putting more resources into the cluster:
*gave the data nodes more memory, now at 32g and 20g for heap
*more cpu, now at 6 vcpu was 4
*more nodes, was at 2 masters + 6 data, now at 2 masters + 8 data.
It still wont answer my prayers... errr requests...

I dont know where the bottleneck is. They are all in the same VLAN with no firewalls, HW is VMware 6.0, Storage is NFS, 1 mount of 9TB per node, apprx. 65% full.
It ran fine, until it didnt...

Any suggestions please?

/Martin

warkolm · September 2, 2017, 10:59pm

That's bad, there is no majority of 2, see Important Configuration Changes | Elasticsearch: The Definitive Guide [2.x] | Elastic

That's too many shards, use _shrink to reduce that so each index is a single shard.

Christian_Dahlqvist · September 3, 2017, 6:31am

Based on the data you have provided you have around 6TB of data per node with an average shard size of 4GB. In order to be able to hold as much data as possible per node, you should aim for a larger shard size, ideally somewhere around 20-30GB in size. Use the shrink API to reduce the number of primary shards to 1 as Mark suggested. Start with the smallest indices. You can also reduce overhead by running a force merge down to a few segments (is I/O intensive) once the shrink operation has completed. Only do this for indices no longer being indexed into.

I would also recommend installing X-Pack Monitoring if you have not already, as this will give you a better idea about what is going on.

Also add another master node as per Mark's suggestion. You always want 3 dedicated master nodes so that you can lose one and the remaining ones are able to form a majority and elect a master.

martinrm77 · September 3, 2017, 6:46am

What about new indices? I have it set to 8 now, to split new indices over all 8 data nodes to get max indexing performance.

Christian_Dahlqvist · September 3, 2017, 6:53am

If you need to have 8 primary shards I would recommend having each index cover a longer time period, e.g. a week or a month. It is also possible to index into 8 primary shards and then use the shrink API to reduce the number of primary shards when it is no longer indexed into.

You can also use the rollover API to cut indices based on size rather than time to ensure you don't end up with too many small shards, which is inefficient. If you have a long retention period, all indices does not necessarily need to cover the same amount of time.

warkolm · September 3, 2017, 9:06am

Given your volume size, you don't really need to worry about index performance at this point.

system · October 1, 2017, 9:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch cluster 5.5 under heavy load Elasticsearch	7	447	November 11, 2020
Elasticsearch cluster instability Elasticsearch	13	2862	July 6, 2017
Problems in my Cluster Elasticsearch	12	1177	June 20, 2017
Elasticsearch heap issues Elasticsearch	4	462	July 5, 2017
Elasticsearch slows down over time Elasticsearch	7	3089	October 22, 2019

Es 5.5 cluster unsearcable, node stats time-outs

Related topics