Failed to process cluster event (put-lifecycle-fb_test) within 30s

I am running a 3 node cluster for Elasticsearch. I get the error below when I try to create an ILM policy

org.elasticsearch.transport.RemoteTransportException: [es3][192.168.10.52:9300][cluster:admin/ilm/put]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (put-lifecycle-fb_test) within 30s
	at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:158) ~[elasticsearch-7.17.9.jar:7.17.9]
	at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
	at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:157) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) ~[elasticsearch-7.17.9.jar:7.17.9]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1589) [?:?]

The following is my cluster details:

{
  "cluster_name" : "ess",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 2448,
  "active_shards" : 2503,
  "relocating_shards" : 0,
  "initializing_shards" : 4,
  "unassigned_shards" : 2120,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 134,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 1143201,
  "active_shards_percent_as_number" : 54.09552625891506
}

I have tried to restart the cluster, but every time I try to perform a synced flush, I get error 502 after 30 seconds.

The cluster is red so you might need to wait for the missing primary shards to be recovered.

Hello,

Even when the cluster health is Yellow, I still get the same error.

Recently I also get these errors on the cluster.

[2024-03-27T12:49:42,962][WARN ][o.e.g.PersistedClusterStateService] [es3] writing cluster state took [15006ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [4] indices and skipped [1598] unchanged indices

What is the hardware specification of the cluster? What type of storage are you using?

I have 3 nodes as follows:
node1: 64 GB RAM, 32G heap size, 15 TB nvme SSD, 8 cores
node2: 128GB RAM, 64GB heap size, 15 TB nvme SSD,12 cores
node3: 128GB RAM, 64GB heap size, 15 TB nvme SSD, 12 cores

All the nodes are master eligible.

Additional info

shards disk.indices disk.used disk.avail disk.total disk.percent host            ip              node
  1390        3.4tb     5.5tb      8.8tb     14.4tb           38 192.168.10.52 192.168.10.52 es1
   192      103.1gb     3.3tb       11tb     14.4tb           23 192.168.10.81 192.168.10.81 es2
  1344          5tb     6.1tb      8.3tb     14.4tb           42 192.168.10.25 192.168.10.25 es3
  1703                                                                                           UNASSIGNED

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.