Lately our logging elasticsearch cluster started acting up, crashing a node once in a while because of heap out of memory issues. I started putting more resources into our 6 nodes, continued upgrading from 2.4 to 5.5 and now I'm stuck with cluster that does index data, but will not answer my queries.
Right now its 10 nodes, 8 data, 2 master. around 1080 indices of 10-50gb in 11500 shards. Its gotten big, maybe too big.
Now it keeps timing out requests and there is a lot of node stats errors in the logs on the master: org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticdb02pl][10.77.168.41:9300][cluster:monitor/nodes/stats[n]] request_id [233363] timed out after [15000ms]
I tried putting more resources into the cluster:
*gave the data nodes more memory, now at 32g and 20g for heap
*more cpu, now at 6 vcpu was 4
*more nodes, was at 2 masters + 6 data, now at 2 masters + 8 data.
It still wont answer my prayers... errr requests...
I dont know where the bottleneck is. They are all in the same VLAN with no firewalls, HW is VMware 6.0, Storage is NFS, 1 mount of 9TB per node, apprx. 65% full.
It ran fine, until it didnt...
Based on the data you have provided you have around 6TB of data per node with an average shard size of 4GB. In order to be able to hold as much data as possible per node, you should aim for a larger shard size, ideally somewhere around 20-30GB in size. Use the shrink API to reduce the number of primary shards to 1 as Mark suggested. Start with the smallest indices. You can also reduce overhead by running a force merge down to a few segments (is I/O intensive) once the shrink operation has completed. Only do this for indices no longer being indexed into.
I would also recommend installing X-Pack Monitoring if you have not already, as this will give you a better idea about what is going on.
Also add another master node as per Mark's suggestion. You always want 3 dedicated master nodes so that you can lose one and the remaining ones are able to form a majority and elect a master.
If you need to have 8 primary shards I would recommend having each index cover a longer time period, e.g. a week or a month. It is also possible to index into 8 primary shards and then use the shrink API to reduce the number of primary shards when it is no longer indexed into.
You can also use the rollover API to cut indices based on size rather than time to ensure you don't end up with too many small shards, which is inefficient. If you have a long retention period, all indices does not necessarily need to cover the same amount of time.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.