Master election takes minutes

Recently I upgraded my 3 master-eligible nodes from Ubuntu 16.04 to 18.04. The master changed twice during this procedure. I noticed that these two master elections took about 1 minute and 4 minutes, respectively. This is longer than I expected.

Does anyone know why it is taking this long? Previously it was only a few seconds.

My master configuration looks like this for master0 (the other two are similar):

node.name: master0
node.data: false
node.master: true
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host:
  - _site_
  - _local_
discovery.zen.ping.unicast.hosts: ["data3.example.com", "data4.example.com", "data5.example.com","master0.example.com", "master1.example.com", "master2.example.com" ]
cluster.initial_master_nodes:
  - master0
  - master1
  - master2
discovery.zen.minimum_master_nodes: 2
xpack.security.enabled: false
xpack.watcher.enabled: false
xpack.ml.enabled: false
xpack.monitoring.collection.enabled: true

Here is some additional information and my own observations:

  • I am using Elasticsearch 7.5.2. Of course one could suggest to upgrade first but I'm trying to understand what goes wrong before I do.
  • I still use discovery.zen.ping.unicast.hosts and discovery.zen.minimum_master_nodes from a previous Elasticsearch 6.7 upgrade. These settings are deprecated but are mapped to discovery.seed_hosts and ignored, respectively. Seems to me that these should not play a role in this.
  • The discovery.zen.ping.unicast.hosts also lists a couple of data nodes. I understand that on Elasticsearch 7.x these are ignored for discovery. Should not play a role either, I guess.
  • I left cluster.initial_master_nodes in, even though it is only required when the cluster has not been formed yet. I assumed it's ignored then.
  • My master nodes are using DHCP (not my call) but fortunately the DHCP server gives them a fixed IP address. Ubuntu and Debian use a trick for such hosts where the hostname resolves to IP address 127.0.1.1 in /etc/hosts. But Elasticsearch is not listening on this IP address because I configured network.host: _local_. The documentation says that _local_ means: " Any loopback addresses on the system, for example 127.0.0.1". You could interpret this to include 127.0.1.1 too. Could this be an issue? Master1.log mentions something about discovery using 127.0.1.1:9300. Perhaps I should change the Elasticsearch config to use network.host: 0.0.0.0 instead?
  • Master nodes are running on VMware with 4 vCPUs and 16 GB RAM. I don't know about the exact storage but assume that it is host attached. Data nodes are dedicated hardware with NVME SSDs.

I have included logs of the 3 master nodes below. What I did was:

  • Stopped Elasticsearch on master2. Master0 remained master. I upgraded the OS of master2 at 13:54:46, rebooted and started Elasticsearch.
  • Stopped Elasticsearch on master0 at 14:29:07. Master1 appears to be master at 14:30:16. I upgraded the OS of master0, rebooted and started Elasticsearch.
  • Stopped Elasticsearch on master1 at 15:39:23. After a few minutes a new master was still not elected so I started master1 again at 15:43:11. Master2 appears to be master at 15:44:29. I then stopped Elasticsearch on master1 again, upgraded its OS, rebooted and started Elasticsearch.

Thanks for reading to the end!

[2021-05-03T14:49:06,641][WARN ][o.e.g.IncrementalClusterStateWriter] [master0] writing cluster state took [48023ms] which is above the warn threshold of [10s]; wrote metadata for [4361] indices and skipped [0] unchanged indices

This message indicates two problems:

  • you have over 4000 indices
  • writing a few kB of metadata for those indices took nearly a minute

I suggest upgrading to pick up #50907 which will streamline things a bit, but the fundamental problem seems to be that you have too many indices and the disks on your master nodes are too slow.

1 Like

Thanks, David. It's really appreciated. I had read your previous post but wasn't quite sure if it was a similar situation.

Do you still suggest increasing cluster.publish.timeout and/or cluster.join.timeout as a temporary workaround until we upgrade?

I was aware that we have a suboptimal number of shards for several of our indices. But I did not realise that the number of indices has this much of an impact on writing the cluster state to disk.

I am a bit surprised that writing a few kB per index would take this much time and that my disks are too slow. Especially because the documentation says that "cluster state updates are typically published as diffs to the previous cluster state".

Yes your situation sounds similar to that older post, increasing those timeouts might help a bit.

It's true that cluster state updates are typically published as diffs, but the first update after an election is not typical (it's more likely it cannot use the diff mechanism) and anyway the issue isn't in how the update is published, it's how the new state is written to disk. To avoid some subtle failure cases we have to re-write everything after an election.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.