I have been using Elasticsearch for several years now. Last year I tried to update to 5.x but had heaps of issues getting it to even run in a Docker environment. I stayed with 2.4 up until now and everything has been running smoothly.
A few weeks ago I updated the host OS of the Docker Swarm from Ubuntu 16.04 to Ubuntu 18.04. I didn't change anything with Elasticsearch, still using the same Docker images I've been using for over a year.
Since the OS update there have been frequent occasions where a seemingly random node in the 3-node cluster drops all its shards, sending the cluster to yellow. A few times per day two nodes drop at once and the cluster goes to red.
Many other services run in the same swarm and never have any difficulty with the network. The only issues I've had since the host OS upgrade have been with Elasticsearch.
Today I completed moving everything over to an Elasticsearch 6.3 cluster, hoping that the updates to Elasticsearch would fix whatever is causing the nodes to drop.
Same problem.
The nodes are in the same datacentre.
Keepalive settings on host OS:
net.ipv4.tcp_keepalive_time=180
et.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=5
I'm hoping someone who understands more about the internals of Elasticsearch will be able to work out what the issue is from the below logs. You'll see that at the time of these logs I initiated a snapshot. I don't know if this caused the drop, but ALL previous drops over the last few weeks have been unprovoked. After I got the cluster back to a decent state I ran a snapshot again and it completed just fine. You'll notice the log previous to the error was an hour earlier, so, no warning for the drop.
Jul 25 14:59:56 elasticsearch6-2 prod_elasticsearch6-2.1.z6cdi883dcrrquzyeczys3nu3 INFO [o.e.c.r.a.AllocationService] [elasticsearch6-2] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[suburbs1][11]] ...]).
Jul 25 15:33:17 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm WARN [o.e.m.j.JvmGcMonitorService] [elasticsearch6-3] [gc][2842] overhead, spent [866ms] collecting in the last [1.4s]
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm INFO [o.e.d.z.ZenDiscovery] [elasticsearch6-3] master_left [{elasticsearch6-2}{CARsLMAnQh-UdEpAxx_6Cw}{Hm5hXYqTQz2Wy7lMU9DMow}{10.0.2.248}{10.0.2.248:9300}], reason [no longer master]
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info org.elasticsearch.transport.RemoteTransportException: [elasticsearch6-2][10.0.2.249:9300][internal:discovery/zen/fd/master_ping]
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info Caused by: org.elasticsearch.cluster.NotMasterException: local node is not master
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm WARN [o.e.d.z.ZenDiscovery] [elasticsearch6-3] master left (reason = no longer master), current nodes: nodes:
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info {elasticsearch6-2}{CARsLMAnQh-UdEpAxx_6Cw}{Hm5hXYqTQz2Wy7lMU9DMow}{10.0.2.248}{10.0.2.248:9300}, master
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info {elasticsearch6-1}{IaWLCp9eTd-aHn5eewaM4Q}{U816MhOjTt-6vcWyCCZ2Ew}{10.0.2.253}{10.0.2.253:9300}
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info {elasticsearch6-3}{V0D1s_oIR5aHGox7LoHUGg}{EtQLhJhmTZi74_2DVo3QyA}{10.0.2.250}{10.0.2.250:9300}, local
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u WARN [r.suppressed] path: /_snapshot/s3-backup/production-20180725063205, params: {repository=s3-backup, wait_for_completion=false, snapshot=production-20180725063205}
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info org.elasticsearch.transport.RemoteTransportException: [elasticsearch6-2][10.0.2.249:9300][cluster:admin/snapshot/create]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info Caused by: org.elasticsearch.discovery.MasterNotDiscoveredException: FailedToCommitClusterStateException[timed out while waiting for enough masters to ack sent cluster state. [1] left]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:223) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:145) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:117) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.retry(TransportMasterNodeAction.java:208) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.access$500(TransportMasterNodeAction.java:108) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onFailure(TransportMasterNodeAction.java:165) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info at org.elasticsearch.action.admin.cluster.snapshots.create.TransportCreateSnapshotAction$1.onFailure(TransportCreateSnapshotAction.java:112) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info at org.elasticsearch.snapshots.SnapshotsService$1.onFailure(SnapshotsService.java:274) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.onFailure(MasterService.java:467) ~[elasticsearch-6.3.1.jar:6.3.1]