Lately our logging elasticsearch cluster started acting up, crashing a node once in a while because of heap out of memory issues. I started putting more resources into our 6 nodes, continued upgrading from 2.4 to 5.5 and now I'm stuck with cluster that does index data, but will not answer my queries.
Right now its 10 nodes, 8 data, 2 master. around 1080 indices of 10-50gb in 11500 shards. Its gotten big, maybe too big.
Now it keeps timing out requests and there is a lot of node stats errors in the logs on the master:
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticdb02pl][10.77.168.41:9300][cluster:monitor/nodes/stats[n]] request_id  timed out after [15000ms]
I tried putting more resources into the cluster:
*gave the data nodes more memory, now at 32g and 20g for heap
*more cpu, now at 6 vcpu was 4
*more nodes, was at 2 masters + 6 data, now at 2 masters + 8 data.
It still wont answer my prayers... errr requests...
I dont know where the bottleneck is. They are all in the same VLAN with no firewalls, HW is VMware 6.0, Storage is NFS, 1 mount of 9TB per node, apprx. 65% full.
It ran fine, until it didnt...
Any suggestions please?