Disable node during GC

ebuildy · April 5, 2018, 3:32pm

In a 10 nodes cluster, with ES 1.5, Java 1.8, 8g of heap:

Sometime when we receive big traffic, a node does a stop-the-world GC during 10 secondes, but it seems other nodes are still using it to query, hence the node is queuing then boom: "rejected execution (queue capacity 1000)".

In a cluster, how can I deal with long GC pause like this?

tomriley · April 5, 2018, 8:38pm

Hey Thomas,

Do you use any sort of load balancer in front of your Elasticsearch cluster? If so, you could look at using the load balancer to detect when a node is not performing sufficiently and remove it from traffic, without having to remove it from the cluster.

Also worth pointing out that performance of Elasticsearch is significantly better these days as there have been many many releases since ES 1.5 was released. It might be worth looking at upgrading or building a new cluster on Elasticsearch 6.x as I would expect the performance gains to be significant enough to improve your cluster performance and I would guess Elastic have made many improvements to how Elastic handles it JVM heap over the years

Cheers,
Tom

ebuildy · April 9, 2018, 8:31am

Hello Tom,

Thanks you, unfortunately, we cannot upgrade too much

We do have a Varnish in front of the cluster, but the problem I see is healthy nodes are sending queries to replica shards located on un-healthy node (I mean node which is garbaging for a long time).

It looks like elasticsearch cluster doesnt take in account long GC pause to say "hey this node is un-healthy".

system · May 7, 2018, 8:31am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.