A disconnected requires that the shutdown happens somewhat gracefully, but when a spot instance dies it may just abruptly drop off the network without an explicit disconnect.
Remember that the fundamental problem here seems to be that nodes are inexplicably shutting down, apparently without logging anything. That's usually the OOM killer, but spot instances would explain it too. This is the first mention of spot instances in this thread.
ok @DavidTurner - we'll give it a try.
We are going to switch the data-wrk nodes to on-demend for the upcoming days.
I will update if that makes any change.
@DavidTurner it was it - spot instances
It was tricky because in another cluster we used them and it was super stable (they were living for days).
After switching to on-demand all of our data-wrk are stable (4 days now)
I think you're probably also not configuring the TCP retransmission timeout as the manual recommends. If you do that then you would get a disconnected message more reliably when a node vanishes, as well as the other benefits described in those docs. But yes, spot instances don't work very well with stateful services like Elasticsearch.
We've experienced the same errors in our cluster and data nodes re-joining the cluster.
In our case the root cause of the issue was swap accidentally turned on.
Our root disks are slow AWS EBS disks, so looks like nodes were "freezing" during swap operations and fault detection kicked them from the cluster.
Heavy/slow swapping wouldn't lead to the same symptoms as described in this thread. It may cause nodes to leave the cluster but they would normally rejoin without restarting, having logged messages about slow GC or blocked threads.
Definitely check that swap is disabled, but I don't think it's related.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.