Publication of cluster state fails - followers check retry count exceeded

DavidTurner · May 18, 2023, 12:06pm

A disconnected requires that the shutdown happens somewhat gracefully, but when a spot instance dies it may just abruptly drop off the network without an explicit disconnect.

Remember that the fundamental problem here seems to be that nodes are inexplicably shutting down, apparently without logging anything. That's usually the OOM killer, but spot instances would explain it too. This is the first mention of spot instances in this thread.

Itay_Bittan · May 18, 2023, 12:19pm

ok @DavidTurner - we'll give it a try.
We are going to switch the data-wrk nodes to on-demend for the upcoming days.
I will update if that makes any change.

Thank you

Itay_Bittan · May 22, 2023, 9:36am

@DavidTurner it was it - spot instances
It was tricky because in another cluster we used them and it was super stable (they were living for days).
After switching to on-demand all of our data-wrk are stable (4 days now)

Thanks a LOT!

DavidTurner · May 22, 2023, 9:48am

Makes sense, thanks for confirming

I think you're probably also not configuring the TCP retransmission timeout as the manual recommends. If you do that then you would get a disconnected message more reliably when a node vanishes, as well as the other benefits described in those docs. But yes, spot instances don't work very well with stateful services like Elasticsearch.

kkn87 · June 5, 2023, 3:23pm

We've experienced the same errors in our cluster and data nodes re-joining the cluster.
In our case the root cause of the issue was swap accidentally turned on.
Our root disks are slow AWS EBS disks, so looks like nodes were "freezing" during swap operations and fault detection kicked them from the cluster.

Itay_Bittan · June 18, 2023, 9:49am

thanks @kkn87 - how did you find it? (I want to make sure mine configured correctly)

DavidTurner · June 18, 2023, 10:26am

Heavy/slow swapping wouldn't lead to the same symptoms as described in this thread. It may cause nodes to leave the cluster but they would normally rejoin without restarting, having logged messages about slow GC or blocked threads.

Definitely check that swap is disabled, but I don't think it's related.

Itay_Bittan · June 19, 2023, 5:40am

thanks @DavidTurner - what's the best way to ensure that?

DavidTurner · June 19, 2023, 7:38am

How can I check if swap is active from the command line? - Unix & Linux Stack Exchange gives a good selection of ways to do that

kkn87 · June 20, 2023, 4:31pm

I've discovered high swapping during node outages using atop tool (GitHub - Atoptool/atop: System and process monitor for Linux), but any monitoring tool can help.

system · July 18, 2023, 4:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nodes randomly not getting latest cluster state Elasticsearch	4	305	July 6, 2017
How to recover cluster state red in K8s Elasticsearch	2	274	June 9, 2021
Elasticsearch slowness and frequently nodes are restarting Elasticsearch	2	221	November 26, 2021
Upgrade from 6.8.3 to 7.X.X results in "failed to apply updated cluster state" Elasticsearch	13	1552	December 29, 2019
So many Error in my cluster Elasticsearch	3	360	February 28, 2020

Publication of cluster state fails - followers check retry count exceeded

Related topics