Masters leave for no apparent reason

I have been using Elasticsearch for several years now. Last year I tried to update to 5.x but had heaps of issues getting it to even run in a Docker environment. I stayed with 2.4 up until now and everything has been running smoothly.

A few weeks ago I updated the host OS of the Docker Swarm from Ubuntu 16.04 to Ubuntu 18.04. I didn't change anything with Elasticsearch, still using the same Docker images I've been using for over a year.

Since the OS update there have been frequent occasions where a seemingly random node in the 3-node cluster drops all its shards, sending the cluster to yellow. A few times per day two nodes drop at once and the cluster goes to red.

Many other services run in the same swarm and never have any difficulty with the network. The only issues I've had since the host OS upgrade have been with Elasticsearch.

Today I completed moving everything over to an Elasticsearch 6.3 cluster, hoping that the updates to Elasticsearch would fix whatever is causing the nodes to drop.

Same problem.

The nodes are in the same datacentre.

Keepalive settings on host OS:
net.ipv4.tcp_keepalive_time=180
et.ipv4.tcp_keepalive_intvl=60
net.ipv4.tcp_keepalive_probes=5

I'm hoping someone who understands more about the internals of Elasticsearch will be able to work out what the issue is from the below logs. You'll see that at the time of these logs I initiated a snapshot. I don't know if this caused the drop, but ALL previous drops over the last few weeks have been unprovoked. After I got the cluster back to a decent state I ran a snapshot again and it completed just fine. You'll notice the log previous to the error was an hour earlier, so, no warning for the drop.

Jul 25 14:59:56 elasticsearch6-2 prod_elasticsearch6-2.1.z6cdi883dcrrquzyeczys3nu3 INFO [o.e.c.r.a.AllocationService] [elasticsearch6-2] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[suburbs1][11]] ...]).

Jul 25 15:33:17 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm WARN [o.e.m.j.JvmGcMonitorService] [elasticsearch6-3] [gc][2842] overhead, spent [866ms] collecting in the last [1.4s]

Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm INFO [o.e.d.z.ZenDiscovery] [elasticsearch6-3] master_left [{elasticsearch6-2}{CARsLMAnQh-UdEpAxx_6Cw}{Hm5hXYqTQz2Wy7lMU9DMow}{10.0.2.248}{10.0.2.248:9300}], reason [no longer master]
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info org.elasticsearch.transport.RemoteTransportException: [elasticsearch6-2][10.0.2.249:9300][internal:discovery/zen/fd/master_ping]
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info Caused by: org.elasticsearch.cluster.NotMasterException: local node is not master
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm WARN [o.e.d.z.ZenDiscovery] [elasticsearch6-3] master left (reason = no longer master), current nodes: nodes:
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info   {elasticsearch6-2}{CARsLMAnQh-UdEpAxx_6Cw}{Hm5hXYqTQz2Wy7lMU9DMow}{10.0.2.248}{10.0.2.248:9300}, master
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info   {elasticsearch6-1}{IaWLCp9eTd-aHn5eewaM4Q}{U816MhOjTt-6vcWyCCZ2Ew}{10.0.2.253}{10.0.2.253:9300}
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info   {elasticsearch6-3}{V0D1s_oIR5aHGox7LoHUGg}{EtQLhJhmTZi74_2DVo3QyA}{10.0.2.250}{10.0.2.250:9300}, local
Jul 25 16:32:45 elasticsearch6-3 prod_elasticsearch6-3.1.g4is7belylt9ji46fwpv3cihm info 
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u WARN [r.suppressed] path: /_snapshot/s3-backup/production-20180725063205, params: {repository=s3-backup, wait_for_completion=false, snapshot=production-20180725063205}
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info org.elasticsearch.transport.RemoteTransportException: [elasticsearch6-2][10.0.2.249:9300][cluster:admin/snapshot/create]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info Caused by: org.elasticsearch.discovery.MasterNotDiscoveredException: FailedToCommitClusterStateException[timed out while waiting for enough masters to ack sent cluster state. [1] left]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:223) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info 	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info 	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:145) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info 	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:117) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.retry(TransportMasterNodeAction.java:208) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.access$500(TransportMasterNodeAction.java:108) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info 	at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.onFailure(TransportMasterNodeAction.java:165) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info 	at org.elasticsearch.action.admin.cluster.snapshots.create.TransportCreateSnapshotAction$1.onFailure(TransportCreateSnapshotAction.java:112) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info 	at org.elasticsearch.snapshots.SnapshotsService$1.onFailure(SnapshotsService.java:274) ~[elasticsearch-6.3.1.jar:6.3.1]
Jul 25 16:32:47 elasticsearch6-1 prod_elasticsearch6-1.1.bzj8mpxn3oipmw4hasa4qzd8u info 	at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.onFailure(MasterService.java:467) ~[elasticsearch-6.3.1.jar:6.3.1]

I had a similar issue and found it was 1 of 2 things. I changed both and it stopped. First off, I increased the JVM Heap memory allocation. I went to 10GB from 8GB on a node with 16GB running Ubuntu 16.04 Server. Whatever level you're at, try a bump up if you can and see if the new version is requiring more memory. There are definitely garbage collection issues causing netty problems in my world, so the increased Heap worked to mitigate it.

Secondly, I had 1 node that had far fewer resources than the others so I was using settings in the elasticsearch.yml file to mandate data nodes and master nodes. I removed the baby node because it was causing too many issues when it was elected master, and then removed all the other data/master node settings. It works correctly on it's own now. So if you set those manually as I did, you might try removing them and see if the system fixes itself by magic. Sometimes nobody can explain why it breaks, so I'll accept when nobody can explain why it works....

Thanks @michaelm14. It seems the issue I had after updating to Elasticsearch 6.3 might be unrelated to the frequent drops in 2.4. I haven't seen them come back for about 12 hours now. I guess the cluster going to shizzle when I commenced the snapshot the first time was an Elasticsearch bug.

Sometimes nobody can explain why it breaks, so I'll accept when nobody can explain why it works

Haha yeah, I know your pain. Elasticsearch is the only database I've ever managed that is so brittle.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.