ES master CPU spike suddenly 100% and fails to apply cluster state

We are currently operating with ES5.6. All nodes in the ES cluster are master and data eligible.
The min_master_nodes is set to n/2+1.
We are recently seeing issues that the matser's CPU usage spikes up very high ~90% and all Cluster update task start failing with a timeout.

TraceLevel="DEBUG" ComponentName="default" Message="[2020-10-08T06:45:53,284][DEBUG][org.elasticsearch.action.admin.cluster.reroute.TransportClusterRerouteAction] failed to perform [cluster_reroute (api)]
ProcessClusterEventTimeoutException[failed to process cluster event (cluster_reroute (api)) within 30s]
at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$onTimeout$1(ClusterService.java:254)

Soon after we see multiple logs on the master indicating write and flush on network layer failed

[org.elasticsearch.transport.netty4.Netty4Transport] write and flush on the network layer failed (channel: [id: 0xb8a365a6, L:0.0.0.0/0.0.0.0:9300 ! R:/xxxxxxxxx])
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

We have recently add a new check to our rolling restart logic to continue issuing allocation/reroute/retry_failed every 1 minute until the cluster converges.

In rolling restart, since the node is down, the reroute will fail many times before succeeding. Could this be causing extra load on the master?

The cluster remains healthy otherwise but during rolling restart we see such issues happening frequently.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.