ES master CPU spike suddenly 100% and fails to apply cluster state

UDixit · October 8, 2020, 4:55pm

We are currently operating with ES5.6. All nodes in the ES cluster are master and data eligible.
The min_master_nodes is set to n/2+1.
We are recently seeing issues that the matser's CPU usage spikes up very high ~90% and all Cluster update task start failing with a timeout.

TraceLevel="DEBUG" ComponentName="default" Message="[2020-10-08T06:45:53,284][DEBUG][org.elasticsearch.action.admin.cluster.reroute.TransportClusterRerouteAction] failed to perform [cluster_reroute (api)]
ProcessClusterEventTimeoutException[failed to process cluster event (cluster_reroute (api)) within 30s]
at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$onTimeout$1(ClusterService.java:254)

Soon after we see multiple logs on the master indicating write and flush on network layer failed

[org.elasticsearch.transport.netty4.Netty4Transport] write and flush on the network layer failed (channel: [id: 0xb8a365a6, L:0.0.0.0/0.0.0.0:9300 ! R:/xxxxxxxxx])
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

We have recently add a new check to our rolling restart logic to continue issuing allocation/reroute/retry_failed every 1 minute until the cluster converges.

In rolling restart, since the node is down, the reroute will fail many times before succeeding. Could this be causing extra load on the master?

The cluster remains healthy otherwise but during rolling restart we see such issues happening frequently.

system · November 5, 2020, 4:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES master nodes CPU usage decrease over time and became red nodes Elasticsearch	3	370	September 7, 2020
Elastic master node high cpu Elasticsearch	11	4107	May 15, 2020
New Elasticsearch 7.6.0 cluster eventually becomes unresponsive Elasticsearch	3	377	April 13, 2020
CPU for one of the nodes is high frequently Elasticsearch	2	587	September 28, 2017
Elasticsearch 1.5.2 master unresponsive Elasticsearch	1	400	July 6, 2017

ES master CPU spike suddenly 100% and fails to apply cluster state

Related topics