Publication of cluster state fails - followers check retry count exceeded

A disconnected requires that the shutdown happens somewhat gracefully, but when a spot instance dies it may just abruptly drop off the network without an explicit disconnect.

Remember that the fundamental problem here seems to be that nodes are inexplicably shutting down, apparently without logging anything. That's usually the OOM killer, but spot instances would explain it too. This is the first mention of spot instances in this thread.

1 Like

ok @DavidTurner - we'll give it a try.
We are going to switch the data-wrk nodes to on-demend for the upcoming days.
I will update if that makes any change.

Thank you

@DavidTurner it was it - spot instances :frowning:
It was tricky because in another cluster we used them and it was super stable (they were living for days).
After switching to on-demand all of our data-wrk are stable (4 days now) :slight_smile:

Thanks a LOT!

Makes sense, thanks for confirming :slight_smile:

I think you're probably also not configuring the TCP retransmission timeout as the manual recommends. If you do that then you would get a disconnected message more reliably when a node vanishes, as well as the other benefits described in those docs. But yes, spot instances don't work very well with stateful services like Elasticsearch.

1 Like

We've experienced the same errors in our cluster and data nodes re-joining the cluster.
In our case the root cause of the issue was swap accidentally turned on.
Our root disks are slow AWS EBS disks, so looks like nodes were "freezing" during swap operations and fault detection kicked them from the cluster.

2 Likes

thanks @kkn87 - how did you find it? (I want to make sure mine configured correctly)

Heavy/slow swapping wouldn't lead to the same symptoms as described in this thread. It may cause nodes to leave the cluster but they would normally rejoin without restarting, having logged messages about slow GC or blocked threads.

Definitely check that swap is disabled, but I don't think it's related.

1 Like

thanks @DavidTurner - what's the best way to ensure that? :slight_smile:

How can I check if swap is active from the command line? - Unix & Linux Stack Exchange gives a good selection of ways to do that

1 Like

I've discovered high swapping during node outages using atop tool (GitHub - Atoptool/atop: System and process monitor for Linux), but any monitoring tool can help.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.