Publication of cluster state fails - followers check retry count exceeded

Hi everyone,

We are running Elasticsearch 8.6.1 (on Kubernetes) with 12 data nodes (pods) and 3 dedicated master pods.

We are heavily indexing data (bulks) and we did some configuration tunes to improve indexing:

indices.memory.index_buffer_size: 30%
cluster.routing.allocation.node_concurrent_recoveries: 16
indices.recovery.max_bytes_per_sec: 60mb

The problem is that the data nodes/pods are being replaced too frequently (on a daily basis).
While CPU/Memory looks fine (as well as all other metrics we can see), the only thing we found is some logs in the active master node:

after [10s] publication of cluster state version [691889] is still waiting for {elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{tN5CydPoStCQOHwZc9s79g}{elasticsearch-aud-es-data-wrk-1}{10.2.40.198}{10.2.40.198:9300}{d}{k8s_node_name=ip-10-2-43-79.ec2.internal, xpack.installed=true, zone=WORKER} [SENT_APPLY_COMMIT]
after [30s] publication of cluster state version [691888] is still waiting for {elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{tN5CydPoStCQOHwZc9s79g}{elasticsearch-aud-es-data-wrk-1}{10.2.40.198}{10.2.40.198:9300}{d}{k8s_node_name=ip-10-2-43-79.ec2.internal, xpack.installed=true, zone=WORKER} [SENT_APPLY_COMMIT]

After the second log, the node is removed from cluster and a new pod is added instead.

I read about Publishing the cluster state and seems like I can increase the cluster.publish.timeout from 30s to 60s or something like that - but I wonder if there's a better suitable solution - maybe by increase number of listening thread (not sure which one) etc.

UPDATE: our cluster run on top of AWS, EKS

I'll appreciate your advise,
Thanks!

Cluster state need to be written to disk on all nodes. What does disk I/O and iowait look like on the most heavily loaded nodes in the cluster? What type of storage are you using for the different node types?

1 Like

Before taking any action I suggest you follow the troubleshooting guide for an unstable cluster. If you need help interpreting the results, please share them here. It's possible that you are hitting a bug, and it'd be better to report it so we can fix it rather than working around it by adjusting settings. The cluster.publish.timeout setting is an expert setting and the docs warn you not to change it.

3 Likes

Thanks guys!

Below you can see one of the data node EBS statistics, from AWS (the attached volume spec: gp2, 500 GiB, 1,500 IOPS):

@DavidTurner I read the guide and found that I have ~5 node-left per day, of type followers check retry count exceeded - all due to timeout. Here are few examples:

{"@timestamp":"2023-04-12T08:41:01.436Z", "log.level": "INFO",  "current.health":"RED","message":"Cluster health status changed from [GREEN] to [RED] (reason: [{elasticsearch-aud-es-data-wrk-5}{Cn2_ybugQhqD3S4jR82pqA}{obIy8DnlQeG_fPqPzYEmsw}{elasticsearch-aud-es-data-wrk-5}{10.2.4.252}{10.2.4.252:9300}{d} reason: followers check retry count exceeded [timeouts=3, failures=0]]).","previous.health":"GREEN","reason":"{elasticsearch-aud-es-data-wrk-5}{Cn2_ybugQhqD3S4jR82pqA}{obIy8DnlQeG_fPqPzYEmsw}{elasticsearch-aud-es-data-wrk-5}{10.2.4.252}{10.2.4.252:9300}{d} reason: followers check retry count exceeded [timeouts=3, failures=0]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-12T08:41:01.455Z", "log.level": "INFO", "message":"node-left[{elasticsearch-aud-es-data-wrk-5}{Cn2_ybugQhqD3S4jR82pqA}{obIy8DnlQeG_fPqPzYEmsw}{elasticsearch-aud-es-data-wrk-5}{10.2.4.252}{10.2.4.252:9300}{d} reason: followers check retry count exceeded [timeouts=3, failures=0]], term: 11, version: 552767, delta: removed {{elasticsearch-aud-es-data-wrk-5}{Cn2_ybugQhqD3S4jR82pqA}{obIy8DnlQeG_fPqPzYEmsw}{elasticsearch-aud-es-data-wrk-5}{10.2.4.252}{10.2.4.252:9300}{d}}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.service.MasterService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-11T20:37:21.767Z", "log.level": "INFO", "message":"node-join[{elasticsearch-aud-es-data-srv-1}{hrRiHMw2RSq5MP18WVFA1A}{3Qg76CfgRXiyfNqafiwC-w}{elasticsearch-aud-es-data-srv-1}{10.2.48.5}{10.2.48.5:9300}{d} joining after restart, removed [12.6m/758262ms] ago with reason [followers check retry count exceeded [timeouts=3, failures=0]]], term: 11, version: 536695, delta: added {{elasticsearch-aud-es-data-srv-1}{hrRiHMw2RSq5MP18WVFA1A}{3Qg76CfgRXiyfNqafiwC-w}{elasticsearch-aud-es-data-srv-1}{10.2.48.5}{10.2.48.5:9300}{d}}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.service.MasterService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}

There are also several disconnected logs (more than 5 per day):

{"@timestamp":"2023-04-12T09:29:00.139Z", "log.level": "INFO",  "current.health":"RED","message":"Cluster health status changed from [GREEN] to [RED] (reason: [{elasticsearch-aud-es-data-wrk-6}{xjC-9RqNSxW4s6IRBdrm6A}{hE3TsSrZRmmgZjOBn2Z0FQ}{elasticsearch-aud-es-data-wrk-6}{10.2.14.175}{10.2.14.175:9300}{d} reason: disconnected]).","previous.health":"GREEN","reason":"{elasticsearch-aud-es-data-wrk-6}{xjC-9RqNSxW4s6IRBdrm6A}{hE3TsSrZRmmgZjOBn2Z0FQ}{elasticsearch-aud-es-data-wrk-6}{10.2.14.175}{10.2.14.175:9300}{d} reason: disconnected" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-12T09:29:00.159Z", "log.level": "INFO", "message":"node-left[{elasticsearch-aud-es-data-wrk-6}{xjC-9RqNSxW4s6IRBdrm6A}{hE3TsSrZRmmgZjOBn2Z0FQ}{elasticsearch-aud-es-data-wrk-6}{10.2.14.175}{10.2.14.175:9300}{d} reason: disconnected], term: 11, version: 554447, delta: removed {{elasticsearch-aud-es-data-wrk-6}{xjC-9RqNSxW4s6IRBdrm6A}{hE3TsSrZRmmgZjOBn2Z0FQ}{elasticsearch-aud-es-data-wrk-6}{10.2.14.175}{10.2.14.175:9300}{d}}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.service.MasterService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}

The word lagging doesn't exist in my (master eligable) logs.

Here is how a random data node looks like over the last 6 days:

Checking the TCP retransmissions timeout within the pods shows:

elasticsearch@elasticsearch-aud-es-data-wrk-5:~$ sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15

Is there anything else I can provide that may assist to solve this issue?

Thanks again

We recommend 5.

Yes, the docs I linked describe the next things to consider. The disconnected events suggest a flaky network, and no amount of adjusting ES settings will help with that. The followers check retry count exceeded ones could also be a flaky network, or VM/GC pauses, or a thread starvation bug. You can rule out VM & GC pauses from the server logs. To rule out a thread starvation bug, we'd want to see a couple of jstack dumps from the 30s before the node-left event. Since you don't know when that'll be, it's simplest to just leave jstack && sleep 10 running in a loop.

I would love to change net.ipv4.tcp_retries2 to 5. Do you have an idea how to do it with Elastic operator (for Kubernetes)? Alternatively, can you tell it this should be configured in the node level or in the pod/container?

Same about the jstack - I would love to try it but since I have to run it on a pod which eventually gonna die and replaced by a new one, it will be harder to save this info.. do you have an idea for that? :slight_smile:

Sorry I don't know either of these things. It's possible that pods running under ECK inherit the value of settings like this from the host, so maybe try that? Otherwise if ECK doesn't set this, and doesn't let you set it, then would you raise that as a bug with the ECK folks?

With the jstack thing I expect it's possible to use nsenter to access the contents of a pod once it's running, but again I don't know the details.

Well I did the first part with initContainer (although I think it should come out of the box with Elastic operator if its recommended) - I'll check it with them.
I wonder if that should be applied on both master + data nodes or data nodes only (in my case, with timeout and followers check retry count exceeded ).

I am still working about the jstack - I'll share inputs once I'll get something :slight_smile:

The recommendation applies to all nodes. The default of 15 (~900 second timeout) basically comes from RFC1122 (October 1989) and is just ludicrously long for any reasonably modern system.

1 Like

So I opened an issue for ECK: Proposal - support TCP retransmission timeout · Issue #6698 · elastic/cloud-on-k8s · GitHub
(meanwhile I did it manually with initContainer as describe in the ticket).

Back to the jstack - it turns out that it comes with java (not a java guy :slight_smile: ) but wasn't in the PATH.
Since Elasticsearch run on a statefulSet with persistance volume , I can store it:

elasticsearch@elasticsearch-aud-es-data-wrk-1:~$ echo "while true; do /usr/share/elasticsearch/jdk/bin/jstack 68 >> /usr/share/elasticsearch/data/dump.log; sleep 10; done" > run.sh
elasticsearch@elasticsearch-aud-es-data-wrk-1:~$ chmod +x run.sh
elasticsearch@elasticsearch-aud-es-data-wrk-1:~$ nohup ./run.sh &

After data node restart here is the tail of the log file:

"elasticsearch[elasticsearch-aud-es-data-wrk-6][system_critical_read][T#3]" #119 [1737] daemon prio=5 os_prio=0 cpu=126.03ms elapsed=8916.80s tid=0x00007f6e78199ce0 nid=1737 waiting on condition  [0x00007f6e2faf9000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401dadcf8> (a java.util.concurrent.LinkedTransferQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(org.elasticsearch.server@8.6.1/SizeBlockingQueue.java:152)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][system_read][T#4]" #120 [1767] daemon prio=5 os_prio=0 cpu=76.15ms elapsed=8909.71s tid=0x00007f6e781a0820 nid=1767 waiting on condition  [0x00007f6eac3fb000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401ec8a48> (a java.util.concurrent.LinkedTransferQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(org.elasticsearch.server@8.6.1/SizeBlockingQueue.java:152)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][system_critical_read][T#4]" #121 [1768] daemon prio=5 os_prio=0 cpu=121.95ms elapsed=8907.60s tid=0x00007f6e84114b30 nid=1768 waiting on condition  [0x00007f6e2f7f6000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401dadcf8> (a java.util.concurrent.LinkedTransferQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(org.elasticsearch.server@8.6.1/SizeBlockingQueue.java:152)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][management][T#3]" #122 [1799] daemon prio=5 os_prio=0 cpu=2490.94ms elapsed=8897.74s tid=0x00007f6e881173f0 nid=1799 waiting on condition  [0x00007f6e2f6f5000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401dbae88> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@19.0.1/LockSupport.java:269)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:676)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.poll(java.base@19.0.1/LinkedTransferQueue.java:1294)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1069)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][management][T#4]" #126 [3726] daemon prio=5 os_prio=0 cpu=2408.29ms elapsed=8236.19s tid=0x00007f6e5c153930 nid=3726 waiting on condition  [0x00007f6eae4e3000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401dbae88> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@19.0.1/LockSupport.java:269)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:676)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.poll(java.base@19.0.1/LinkedTransferQueue.java:1294)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1069)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][system_write][T#1]" #129 [6336] daemon prio=5 os_prio=0 cpu=1.54ms elapsed=7307.10s tid=0x00007f6ee11dbe30 nid=6336 waiting on condition  [0x00007f6c748be000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401ec7650> (a java.util.concurrent.LinkedTransferQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(org.elasticsearch.server@8.6.1/SizeBlockingQueue.java:152)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][system_write][T#2]" #130 [6507] daemon prio=5 os_prio=0 cpu=1.23ms elapsed=7247.09s tid=0x00007f6e90072880 nid=6507 waiting on condition  [0x00007f6e0a832000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401ec7650> (a java.util.concurrent.LinkedTransferQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(org.elasticsearch.server@8.6.1/SizeBlockingQueue.java:152)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][system_write][T#3]" #131 [6760] daemon prio=5 os_prio=0 cpu=1.18ms elapsed=7157.10s tid=0x00007f6e90080de0 nid=6760 waiting on condition  [0x00007f6e0a933000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401ec7650> (a java.util.concurrent.LinkedTransferQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(org.elasticsearch.server@8.6.1/SizeBlockingQueue.java:152)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][system_write][T#4]" #132 [6761] daemon prio=5 os_prio=0 cpu=0.74ms elapsed=7157.09s tid=0x00007f6e800265b0 nid=6761 waiting on condition  [0x00007f6dfa20e000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401ec7650> (a java.util.concurrent.LinkedTransferQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(org.elasticsearch.server@8.6.1/SizeBlockingQueue.java:152)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][fetch_shard_started][T#1]" #133 [8311] daemon prio=5 os_prio=0 cpu=24.67ms elapsed=6610.97s tid=0x00007f6e5810c030 nid=8311 waiting on condition  [0x00007f6ead9d8000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401ecaca8> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][fetch_shard_store][T#4]" #148 [8345] daemon prio=5 os_prio=0 cpu=39.48ms elapsed=6605.05s tid=0x00007f6e5c13c4f0 nid=8345 waiting on condition  [0x00007f6ead0cf000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401da2c90> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][flush][T#4]" #158 [9199] daemon prio=5 os_prio=0 cpu=9564.92ms elapsed=6304.20s tid=0x00007f6eb400e110 nid=9199 waiting on condition  [0x00007f6c74ac0000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401dad208> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][force_merge][T#1]" #162 [13172] daemon prio=5 os_prio=0 cpu=1289.80ms elapsed=4890.60s tid=0x00007f6e781a80b0 nid=13172 waiting on condition  [0x00007f6d0823e000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401ec8df0> (a java.util.concurrent.LinkedTransferQueue)
	at java.util.concurrent.locks.LockSupport.park(java.base@19.0.1/LockSupport.java:371)
	at java.util.concurrent.LinkedTransferQueue$Node.block(java.base@19.0.1/LinkedTransferQueue.java:470)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@19.0.1/ForkJoinPool.java:3744)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@19.0.1/ForkJoinPool.java:3689)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:669)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.take(java.base@19.0.1/LinkedTransferQueue.java:1286)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1070)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"Attach Listener" #187 [17681] daemon prio=9 os_prio=0 cpu=49.38ms elapsed=3319.59s tid=0x00007f6f00000e90 nid=17681 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"elasticsearch[elasticsearch-aud-es-data-wrk-6][warmer][T#21]" #193 [23143] daemon prio=5 os_prio=0 cpu=23.47ms elapsed=2199.88s tid=0x00007f6ec8017f80 nid=23143 waiting on condition  [0x00007f6cac39b000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401daebb8> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@19.0.1/LockSupport.java:269)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:676)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.poll(java.base@19.0.1/LinkedTransferQueue.java:1294)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1069)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][warmer][T#22]" #195 [838] daemon prio=5 os_prio=0 cpu=22.71ms elapsed=115.52s tid=0x00007f6ee004c160 nid=838 waiting on condition  [0x00007f6c747bd000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401daebb8> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@19.0.1/LockSupport.java:269)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:676)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.poll(java.base@19.0.1/LinkedTransferQueue.java:1294)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1069)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][warmer][T#23]" #196 [839] daemon prio=5 os_prio=0 cpu=23.98ms elapsed=115.52s tid=0x00007f6ee0b4f730 nid=839 waiting on condition  [0x00007f6c8cb88000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401daebb8> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@19.0.1/LockSupport.java:269)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:676)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.poll(java.base@19.0.1/LinkedTransferQueue.java:1294)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1069)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"elasticsearch[elasticsearch-aud-es-data-wrk-6][warmer][T#24]" #197 [840] daemon prio=5 os_prio=0 cpu=22.44ms elapsed=115.52s tid=0x00007f6ee0b50050 nid=840 waiting on condition  [0x00007f6cac29a000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@19.0.1/Native Method)
	- parking to wait for  <0x0000000401daebb8> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@19.0.1/LockSupport.java:269)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@19.0.1/LinkedTransferQueue.java:676)
	at java.util.concurrent.LinkedTransferQueue.xfer(java.base@19.0.1/LinkedTransferQueue.java:616)
	at java.util.concurrent.LinkedTransferQueue.poll(java.base@19.0.1/LinkedTransferQueue.java:1294)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@19.0.1/ThreadPoolExecutor.java:1069)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@19.0.1/ThreadPoolExecutor.java:1130)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@19.0.1/ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(java.base@19.0.1/Thread.java:1589)

"VM Thread" os_prio=0 cpu=1146.05ms elapsed=9464.28s tid=0x00007f6f807872e0 nid=83 runnable

"GC Thread#0" os_prio=0 cpu=2507.15ms elapsed=9465.61s tid=0x00007f6f8005e7a0 nid=71 runnable

"GC Thread#1" os_prio=0 cpu=2512.33ms elapsed=9465.61s tid=0x00007f6f80094080 nid=74 runnable

"GC Thread#2" os_prio=0 cpu=2510.48ms elapsed=9465.61s tid=0x00007f6f80094ef0 nid=75 runnable

"GC Thread#3" os_prio=0 cpu=2511.91ms elapsed=9465.61s tid=0x00007f6f80095d60 nid=76 runnable

"GC Thread#4" os_prio=0 cpu=2509.38ms elapsed=9465.61s tid=0x00007f6f80096bd0 nid=77 runnable

"GC Thread#5" os_prio=0 cpu=2509.71ms elapsed=9465.61s tid=0x00007f6f80097a40 nid=78 runnable

"GC Thread#6" os_prio=0 cpu=2507.25ms elapsed=9465.61s tid=0x00007f6f800988b0 nid=79 runnable

"GC Thread#7" os_prio=0 cpu=2511.68ms elapsed=9465.61s tid=0x00007f6f80099720 nid=80 runnable

"G1 Main Marker" os_prio=0 cpu=1.52ms elapsed=9465.61s tid=0x00007f6f8006fdd0 nid=72 runnable

"G1 Conc#0" os_prio=0 cpu=127.87ms elapsed=9465.61s tid=0x00007f6f80070cf0 nid=73 runnable

"G1 Conc#1" os_prio=0 cpu=133.25ms elapsed=9462.47s tid=0x00007f6f38000c10 nid=104 runnable

"G1 Refine#0" os_prio=0 cpu=0.10ms elapsed=9464.29s tid=0x00007f6f8075a4d0 nid=81 runnable

"G1 Service" os_prio=0 cpu=5283.89ms elapsed=9464.29s tid=0x00007f6f8075b4b0 nid=82 runnable

"VM Periodic Task Thread" os_prio=0 cpu=3017.88ms elapsed=9464.12s tid=0x00007f6f8002b2a0 nid=96 waiting on condition

JNI global refs: 33, weak refs: 45

(I can add more / look for a relevant info if needed)

Meanwhile, since those nodes are heavily indexing and barley searched, I tried to double the thread_pool.write.size but I was blocked by exception states that I cannot set more than 9 (I tried to set it to 16, and the instance has 8 cores). Is that can be a good direction?

There's nothing of interest in the fragment of stack dump you shared - all these threads are idle. Really we want to see the full dumps from the 30s leading up to a node leaving with reason followers check retry count exceeded, both from the node and the master. And the logs from the node and the master from that time period too.

That won't help, and might be harmful, so it is forbidden indeed.

Hello again!

Master logs, before data-node elasticsearch-aud-es-data-wrk-1 left:

{"@timestamp":"2023-04-24T14:28:37.591Z", "log.level": "INFO", "message":"close connection exception caught on transport layer [Netty4TcpChannel{localAddress=/10.2.38.145:9300, remoteAddress=/10.2.19.122:60600, profile=default}], disconnecting from relevant node: Connection timed out", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][transport_worker][T#2]","log.logger":"org.elasticsearch.transport.TcpTransport","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T14:28:43.735Z", "log.level": "INFO", "message":"close connection exception caught on transport layer [Netty4TcpChannel{localAddress=/10.2.38.145:9300, remoteAddress=/10.2.19.122:60620, profile=default}], disconnecting from relevant node: Connection timed out", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][transport_worker][T#2]","log.logger":"org.elasticsearch.transport.TcpTransport","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T14:29:10.359Z", "log.level": "INFO", "message":"close connection exception caught on transport layer [Netty4TcpChannel{localAddress=/10.2.38.145:9300, remoteAddress=/10.2.19.122:60636, profile=default}], disconnecting from relevant node: Connection timed out", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][transport_worker][T#1]","log.logger":"org.elasticsearch.transport.TcpTransport","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:06:28.035Z", "log.level": "INFO", "message":"after [10s] publication of cluster state version [895541] is still waiting for {elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{nvUgm5GURpKLchZkRkEZqA}{elasticsearch-aud-es-data-wrk-1}{10.2.60.196}{10.2.60.196:9300}{d}{k8s_node_name=ip-10-2-62-194.ec2.internal, zone=WORKER, xpack.installed=true} [SENT_PUBLISH_REQUEST]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.Coordinator.CoordinatorPublication","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:06:35.952Z", "log.level": "WARN", "message":"failed to retrieve stats for node [_p-uv9aASjKMyvpCmlLu-w]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][generic][T#14]","log.logger":"org.elasticsearch.cluster.InternalClusterInfoService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud","error.type":"org.elasticsearch.transport.ReceiveTimeoutTransportException","error.message":"[elasticsearch-aud-es-data-wrk-1][10.2.60.196:9300][cluster:monitor/nodes/stats[n]] request_id [9997711] timed out after [15005ms]","error.stack_trace":"org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-aud-es-data-wrk-1][10.2.60.196:9300][cluster:monitor/nodes/stats[n]] request_id [9997711] timed out after [15005ms]\n"}
{"@timestamp":"2023-04-24T15:06:35.959Z", "log.level": "WARN", "message":"failed to retrieve shard stats from node [_p-uv9aASjKMyvpCmlLu-w]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][management][T#2]","log.logger":"org.elasticsearch.cluster.InternalClusterInfoService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud","error.type":"org.elasticsearch.transport.ReceiveTimeoutTransportException","error.message":"[elasticsearch-aud-es-data-wrk-1][10.2.60.196:9300][indices:monitor/stats[n]] request_id [9997723] timed out after [15005ms]","error.stack_trace":"org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-aud-es-data-wrk-1][10.2.60.196:9300][indices:monitor/stats[n]] request_id [9997723] timed out after [15005ms]\n"}
{"@timestamp":"2023-04-24T15:06:48.037Z", "log.level": "WARN", "message":"after [30s] publication of cluster state version [895541] is still waiting for {elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{nvUgm5GURpKLchZkRkEZqA}{elasticsearch-aud-es-data-wrk-1}{10.2.60.196}{10.2.60.196:9300}{d}{k8s_node_name=ip-10-2-62-194.ec2.internal, zone=WORKER, xpack.installed=true} [SENT_PUBLISH_REQUEST]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][clusterApplierService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.coordination.Coordinator.CoordinatorPublication","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:06:58.042Z", "log.level": "INFO", "message":"after [10s] publication of cluster state version [895542] is still waiting for {elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{nvUgm5GURpKLchZkRkEZqA}{elasticsearch-aud-es-data-wrk-1}{10.2.60.196}{10.2.60.196:9300}{d}{k8s_node_name=ip-10-2-62-194.ec2.internal, zone=WORKER, xpack.installed=true} [SENT_PUBLISH_REQUEST]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.Coordinator.CoordinatorPublication","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:07:18.043Z", "log.level": "WARN", "message":"after [30s] publication of cluster state version [895542] is still waiting for {elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{nvUgm5GURpKLchZkRkEZqA}{elasticsearch-aud-es-data-wrk-1}{10.2.60.196}{10.2.60.196:9300}{d}{k8s_node_name=ip-10-2-62-194.ec2.internal, zone=WORKER, xpack.installed=true} [SENT_PUBLISH_REQUEST]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][clusterApplierService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.coordination.Coordinator.CoordinatorPublication","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:07:18.049Z", "log.level": "INFO",  "current.health":"YELLOW","message":"Cluster health status changed from [GREEN] to [YELLOW] (reason: [{elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{nvUgm5GURpKLchZkRkEZqA}{elasticsearch-aud-es-data-wrk-1}{10.2.60.196}{10.2.60.196:9300}{d} reason: followers check retry count exceeded [timeouts=3, failures=0]]).","previous.health":"GREEN","reason":"{elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{nvUgm5GURpKLchZkRkEZqA}{elasticsearch-aud-es-data-wrk-1}{10.2.60.196}{10.2.60.196:9300}{d} reason: followers check retry count exceeded [timeouts=3, failures=0]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:07:18.055Z", "log.level": "INFO", "message":"node-left[{elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{nvUgm5GURpKLchZkRkEZqA}{elasticsearch-aud-es-data-wrk-1}{10.2.60.196}{10.2.60.196:9300}{d} reason: followers check retry count exceeded [timeouts=3, failures=0]], term: 17, version: 895543, delta: removed {{elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{nvUgm5GURpKLchZkRkEZqA}{elasticsearch-aud-es-data-wrk-1}{10.2.60.196}{10.2.60.196:9300}{d}}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.service.MasterService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:07:18.112Z", "log.level": "INFO", "message":"removed {{elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{nvUgm5GURpKLchZkRkEZqA}{elasticsearch-aud-es-data-wrk-1}{10.2.60.196}{10.2.60.196:9300}{d}}, term: 17, version: 895543, reason: Publication{term=17, version=895543}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][clusterApplierService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.service.ClusterApplierService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:07:18.114Z", "log.level": "INFO", "message":"scheduling reroute for delayed shards in [59.9s] (7 delayed shards)", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][clusterApplierService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.DelayedAllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:07:18.147Z", "log.level": "WARN", "message":"failed to retrieve stats for node [_p-uv9aASjKMyvpCmlLu-w]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][generic][T#14]","log.logger":"org.elasticsearch.cluster.InternalClusterInfoService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud","error.type":"org.elasticsearch.transport.NodeDisconnectedException","error.message":"[elasticsearch-aud-es-data-wrk-1][10.2.60.196:9300][cluster:monitor/nodes/stats[n]] disconnected","error.stack_trace":"org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-aud-es-data-wrk-1][10.2.60.196:9300][cluster:monitor/nodes/stats[n]] disconnected\n"}
{"@timestamp":"2023-04-24T15:07:18.150Z", "log.level": "WARN", "message":"failed to retrieve shard stats from node [_p-uv9aASjKMyvpCmlLu-w]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][management][T#1]","log.logger":"org.elasticsearch.cluster.InternalClusterInfoService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud","error.type":"org.elasticsearch.transport.NodeDisconnectedException","error.message":"[elasticsearch-aud-es-data-wrk-1][10.2.60.196:9300][indices:monitor/stats[n]] disconnected","error.stack_trace":"org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-aud-es-data-wrk-1][10.2.60.196:9300][indices:monitor/stats[n]] disconnected\n"}
{"@timestamp":"2023-04-24T15:08:18.047Z", "log.level": "INFO", "message":"scheduling reroute for delayed shards in [0s] (7 delayed shards)", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.DelayedAllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:08:18.132Z", "log.level": "WARN", "message":"[.apm-custom-link][0] marking unavailable shards as stale: [uHHrXJHaThWfE2ML7KtzBw]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:08:18.278Z", "log.level": "WARN", "message":"[.monitoring-kibana-7-2023.04.21][0] marking unavailable shards as stale: [5n9GMhB8TXShQRUW8TCt9Q]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:08:18.330Z", "log.level": "WARN", "message":"[.monitoring-kibana-7-2023.04.22][0] marking unavailable shards as stale: [eJKg9PalRbmnUEcTGTB3ZA]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:08:18.424Z", "log.level": "WARN", "message":"[.tasks][0] marking unavailable shards as stale: [k3okS39eS1eK34bFEIC5ag]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:08:18.654Z", "log.level": "WARN", "message":"[.kibana-event-log-8.6.1-000001][0] marking unavailable shards as stale: [8MFgfVSiTaWJ7Cr5mKSVOw]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:08:18.803Z", "log.level": "WARN", "message":"[.monitoring-kibana-7-2023.04.19][0] marking unavailable shards as stale: [98tPK7ZYQUOs29pDX4dSTg]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:09:55.968Z", "log.level": "WARN", "message":"[.monitoring-es-7-2023.04.20][0] marking unavailable shards as stale: [gg3Gn-uOQl61bK7dK3QReA]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:09:56.190Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.monitoring-es-7-2023.04.20][0]]]).","previous.health":"YELLOW","reason":"shards started [[.monitoring-es-7-2023.04.20][0]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:14:38.951Z", "log.level": "INFO", "message":"[8776374_23-04-24_15-14-35] creating index, cause [api], templates [], shards [13]/[0]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:14:39.124Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[8776374_23-04-24_15-14-35][6], [8776374_23-04-24_15-14-35][4], [8776374_23-04-24_15-14-35][7], [8776374_23-04-24_15-14-35][3], [8776374_23-04-24_15-14-35][2], [8776374_23-04-24_15-14-35][11], [8776374_23-04-24_15-14-35][1], [8776374_23-04-24_15-14-35][8], [8776374_23-04-24_15-14-35][10], [8776374_23-04-24_15-14-35][12], ... [12 items in total]]]).","previous.health":"YELLOW","reason":"shards started [[8776374_23-04-24_15-14-35][6], [8776374_23-04-24_15-14-35][4], [8776374_23-04-24_15-14-35][7], [8776374_23-04-24_15-14-35][3], [8776374_23-04-24_15-14-35][2], [8776374_23-04-24_15-14-35][11], [8776374_23-04-24_15-14-35][1], [8776374_23-04-24_15-14-35][8], [8776374_23-04-24_15-14-35][10], [8776374_23-04-24_15-14-35][12], ... [12 items in total]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:14:40.127Z", "log.level": "INFO", "message":"[8776374_23-04-24_15-14-35/KZkYD3UlTyek5tGcWss4YQ] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:14:40.193Z", "log.level": "INFO", "message":"[8776374_23-04-24_15-14-35/KZkYD3UlTyek5tGcWss4YQ] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:14:40.240Z", "log.level": "INFO", "message":"[8776374_23-04-24_15-14-35/KZkYD3UlTyek5tGcWss4YQ] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:14:40.293Z", "log.level": "INFO", "message":"[8776374_23-04-24_15-14-35/KZkYD3UlTyek5tGcWss4YQ] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:14:40.301Z", "log.level": "INFO", "message":"[8776374_23-04-24_15-14-35/KZkYD3UlTyek5tGcWss4YQ] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:14:40.354Z", "log.level": "INFO", "message":"[8776374_23-04-24_15-14-35/KZkYD3UlTyek5tGcWss4YQ] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:14:46.911Z", "log.level": "INFO", "message":"node-join[{elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{ILLoNX5WT_myTUT6tHyelg}{elasticsearch-aud-es-data-wrk-1}{10.2.61.208}{10.2.61.208:9300}{d} joining after restart, removed [7.4m/448837ms] ago with reason [followers check retry count exceeded [timeouts=3, failures=0]]], term: 17, version: 895655, delta: added {{elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{ILLoNX5WT_myTUT6tHyelg}{elasticsearch-aud-es-data-wrk-1}{10.2.61.208}{10.2.61.208:9300}{d}}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.service.MasterService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:14:48.356Z", "log.level": "INFO", "message":"added {{elasticsearch-aud-es-data-wrk-1}{_p-uv9aASjKMyvpCmlLu-w}{ILLoNX5WT_myTUT6tHyelg}{elasticsearch-aud-es-data-wrk-1}{10.2.61.208}{10.2.61.208:9300}{d}}, term: 17, version: 895655, reason: Publication{term=17, version=895655}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][clusterApplierService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.service.ClusterApplierService","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:18:35.863Z", "log.level": "INFO", "message":"close connection exception caught on transport layer [Netty4TcpChannel{localAddress=/10.2.38.145:9300, remoteAddress=/10.2.60.196:47424, profile=default}], disconnecting from relevant node: Connection timed out", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][transport_worker][T#1]","log.logger":"org.elasticsearch.transport.TcpTransport","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:18:35.863Z", "log.level": "INFO", "message":"close connection exception caught on transport layer [Netty4TcpChannel{localAddress=/10.2.38.145:9300, remoteAddress=/10.2.60.196:47440, profile=default}], disconnecting from relevant node: Connection timed out", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][transport_worker][T#2]","log.logger":"org.elasticsearch.transport.TcpTransport","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:18:35.863Z", "log.level": "INFO", "message":"close connection exception caught on transport layer [Netty4TcpChannel{localAddress=/10.2.38.145:9300, remoteAddress=/10.2.60.196:47488, profile=default}], disconnecting from relevant node: Connection timed out", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][transport_worker][T#1]","log.logger":"org.elasticsearch.transport.TcpTransport","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}
{"@timestamp":"2023-04-24T15:18:35.863Z", "log.level": "INFO", "message":"close connection exception caught on transport layer [Netty4TcpChannel{localAddress=/10.2.38.145:9300, remoteAddress=/10.2.60.196:47416, profile=default}], disconnecting from relevant node: Connection timed out", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-aud-es-master-1][transport_worker][T#2]","log.logger":"org.elasticsearch.transport.TcpTransport","elasticsearch.cluster.uuid":"2xZgtNgpTAmv_kfkn-jxmw","elasticsearch.node.id":"G4IlCnDPSey6vfW8VvLC1g","elasticsearch.node.name":"elasticsearch-aud-es-master-1","elasticsearch.cluster.name":"elasticsearch-aud"}

here are the dump files (every 10s with date printed at the beginning):
https://drive.google.com/drive/folders/1GUtBNjMlJ7Fj5KS6qVI5t8VVdBe-Jrq9?usp=share_link

I don't see very many dumps from the data node there, only 2, and neither is from the right time. Are you sure they're all there?

Every file contains multiple dumps (with 10s sleep between them and the date it was taken).

Yes, but I only see two such dates in the data node file.

Right (for the data node) because its just before the pod crashed. Then, a new one starts instead without my jstack script :frowning:
What are we looking for? Is there an example of something wrong that I could search for?

Oh, the data nodes are crashing? That wasn't at all what I thought we were discussing. There would normally be logs about the exception that caused a crash in their logs. But if it was a SIGKILL then there won't be. That might be the OOM killer, in which case check the kernel logs with dmesg.

No no no - sorry for the confusion (I answered from mobile on my way home).

Please let me describe the setup, I hope it will be clear and something might come up:
2 master nodes, 12 data nodes (4 serving, 8 indexing - more about that below).
In our use case we create ~1,500 indices per day. Each one of the indices represents daily customer data. Once it indexed, there won't be additional writes. We are adding an alias with the customer Id, and it is ready for serve (only search queries).
Due to this use case, we defined two different types of data nodes: data-worker nodes and data-server nodes.
More accurate, for every customer we do the following:

  1. Create a fresh new index in the data-worker nodes.
  2. Index all docs (in bulks of 400). size of each document might be different for different customers (indices).
  3. Relocating all shards to data-server nodes and wait for completion.
  4. Performing force-merge segments to 1 segment and wait for completion.
  5. changing index replication factor from 0 to 1 and wait for completion.
  6. Flipping alias (remove customer alias from yesterday's index and add it to the new index).
  7. Delete yesterday's index.

Our serving (search queries) rate is really low and all data-serving nodes are stable, as well as the master nodes (two dedicated nodes).
On the other hand our data-worker nodes usually replaced on a daily basis.
We want to index our data as fast as possible, thus we create every index with the following settings:

"number_of_shards": number_of_shards,  # calculate based on the data, average of 20-25GB per shard
"number_of_replicas": 0,
"refresh_interval": -1,
"translog.durability": "async",  # default request
"translog.flush_threshold_size": "2gb",  # default 512mb
"translog.sync_interval": "20s",  # default 5s
"merge.policy.max_merge_at_once": 50,  # default 10
"merge.scheduler.max_thread_count": 20,  # default 2
"routing.allocation.include.zone": "WORKER",

The variety of customers (indices) is huge - from 1 shard index (few documents and few kb) up to 20 shards index with 900M documents, 500GB.

In the data-worker nodes we also set the following:

cluster.routing.allocation.awareness.attributes: zone
cluster.routing.allocation.node_concurrent_recoveries: 16
indices.memory.index_buffer_size: 30%
indices.recovery.max_bytes_per_sec: 60mb
node.attr.zone: WORKER

We added those configuration in order to speed up the indexing and we found those over the internet - if you think that something is wrong here - please let us know!
Another thing, monitoring is enabled in this cluster and the .monitoring-es-* indices are created (~5GB per day, created only on the data-worker nodes somehow, because I didn't define that). Not sure what the impact of that on the cluster - we saw the recommendation to disable it in a production environment we currently use it to deny CPU/Mem/GC issue.

Back to our problem of followers check retry count exceeded - when it happen a data-worker pod is being replaced by a new one, thus, my dump script (that was triggered manually) stops and I only have the dumps before it went down (because I saved it in a persistent volume).

Can we try to go from "the other side"? any suggestion/tip of an action that we can take to see if it has some impact? like changing configuration or something? (this cluster is not consider as production yet, so we can still try things now)

Thanks again

I still think I don't understand the cause-and-effect here. Does the node/pod stop first (which would explain the node-left logs) or does the node leave the cluster first (and then something else shuts it down)?

Hi @DavidTurner - sorry for disappearing, I tried some other direction (no lack yet).
Your last question confuse - as far as I understand, the master waits for the data node to respond, starting with 10s and then 30s. Since the data node didn't respond after 30s the master disconnect the data node and that what makes it restart. Is it right? am I missing something?

The things that we tried are:

  • adding a dedicated coordinating node between the Client (Python bulk operations) and the data-wrk nodes.
  • Tuning the resources (CPU), because we see that it not zero in Kibana (not sure what is the impact and if/or it related):