Rolling restart master election takes 2min

Hi!
On my Elasticsearch 5.3.2 cluster it takes nearly 2min to elect a new master. My screnario is a rolling restart. The master is restarted last. Once it gets restarted the cluster "panics" and searches for a new master. This takes about 2min.
The logfiles repeats a MasterNotDiscoveredException until the new one is found.
I added the logfiles of my restart script, it should be clear what is going on (I hope):

2017-10-16 15:54:13,874 INFO     Restarting loc2elastic2-test-work2-elk-awtest1 (5 of 5 nodes)
2017-10-16 15:54:13,976 INFO     Changing shard allocation to none resulted in: {"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"none"}}}}}
2017-10-16 15:54:13,976 INFO     Restarting ES Node: loc2elastic2-test-work2-elk-awtest1
2017-10-16 15:54:14,140 INFO     Connected (version 2.0, client OpenSSH_6.6.1p1)
2017-10-16 15:54:14,385 INFO     Authentication (publickey) successful!
2017-10-16 15:54:14,723 INFO     detected service name: elasticsearch-test-work2-elk-awtest1
2017-10-16 15:54:18,206 INFO      * Stopping Elasticsearch Server test-work2-elk-awtest1
...done.
* Starting Elasticsearch Server test-work2-elk-awtest1
[2017-10-16T13:54:17,743][WARN ][o.e.c.l.LogConfigurator  ] ignoring unsupported logging configuration file [/etc/elasticsearch/test-work2-elk-awtest1/logging.yml], logging is configured via [/etc/elasticsearch/test-work2-elk-awtest1/log4j2.properties]
...done.

2017-10-16 15:54:18,299 INFO     Server loc2elastic2-test-work2-elk-awtest1 has not joined the cluster yet. Waiting 5 more seconds.
2017-10-16 15:54:23,393 INFO     Server loc2elastic2-test-work2-elk-awtest1 has not joined the cluster yet. Waiting 5 more seconds.
2017-10-16 15:54:28,612 INFO     Server loc2elastic2-test-work2-elk-awtest1 has joined the cluster.
2017-10-16 15:54:58,710 INFO     Changing shard allocation to all resulted in: {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}
2017-10-16 15:54:58,710 INFO     Retrying...
2017-10-16 15:55:28,819 INFO     Changing shard allocation to all resulted in: {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}
2017-10-16 15:55:28,819 INFO     Retrying...
2017-10-16 15:55:58,913 INFO     Changing shard allocation to all resulted in: {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}
2017-10-16 15:55:58,913 INFO     Retrying...
2017-10-16 15:56:21,683 INFO     Changing shard allocation to all resulted in: {"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}
2017-10-16 15:56:21,879 INFO     Waiting for green, current cluster state is: yellow
2017-10-16 15:56:26,981 INFO     Waiting for green, current cluster state is: yellow
2017-10-16 15:56:32,162 INFO     Waiting for green, current cluster state is: green

Any idea if I have configured something wrong?

Take care,
Alex

Hi Alex,

what does your elasticsearch config look like? Actually, what does something like curl -s 'ES-HOST:9200/_cat/nodes' show?

Almost sounds like you only have one master-eligible node in the cluster.. Or maybe there's some firewall rules that block other master-eligible nodes.

-AB

P.S. this is what I see in one of my ES clusters

# curl -s 'es-host0:9200/_cat/nodes'
10.0.10.95 31 99 4 2.02 2.08 2.07 di - es-host1-es-00
10.0.10.97 60 96 8 4.35 3.20 2.71 md * es-host0-es-03
10.0.10.97 25 96 8 4.35 3.20 2.71 di - es-host0-es-02
10.0.10.95 65 99 4 2.02 2.08 2.07 di - es-host1-es-01
10.0.10.95 64 99 3 2.02 2.08 2.07 di - es-host1-es-02
10.0.10.97 68 96 8 4.35 3.20 2.71 md - es-host0-es-04
10.0.10.97 44 96 7 4.35 3.20 2.71 di - es-host0-es-01
10.0.10.95 41 99 4 2.02 2.08 2.07 md - es-host1-es-03
10.0.10.95 34 99 4 2.02 2.08 2.07 md - es-host1-es-04
10.0.10.97 38 96 7 4.35 3.20 2.71 di - es-host0-es-00

* is the current master (second row, 9th column)
m in the 8th column is for master-eligible node

Hi,

and thanks for your answer. All of my nodes are master eligible nodes:

root@loc2elastic0.elk.awtest1.work2.test.be-stg1:~# curl http://$( hostname -f ):9200/_cat/nodes
10.31.253.31 14 73 0 0.01 0.02 0.00 mdi - loc2elastic4-test-work2-elk-awtest1
10.31.253.17 13 72 0 0.00 0.00 0.00 mdi - loc2elastic2-test-work2-elk-awtest1
10.31.253.14 13 74 0 0.05 0.03 0.00 mdi - loc2elastic0-test-work2-elk-awtest1
10.31.253.18  8 70 0 0.00 0.00 0.00 mdi * loc2elastic3-test-work2-elk-awtest1
10.31.253.16 16 72 0 0.00 0.00 0.00 mdi - loc2elastic1-test-work2-elk-awtest1

The config looks like this (I removed all comments and anonymized the hostnames):

---
bootstrap.memory_lock: true
cluster.name: test-work2-elk-awtest1
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts:
- loc2elastic1.elk.awtest1.work2.test.xxx
- loc2elastic2.elk.awtest1.work2.test.xxx
- loc2elastic3.elk.awtest1.work2.test.xxx
- loc2elastic4.elk.awtest1.work2.test.xxx
discovery.zen.ping_timeout: 60s
http.host: 10.31.253.14
http.port: 9200
network.host: 10.31.253.14
network.publish_host: 10.31.253.14
node.data: true
node.ingest: true
node.master: true
node.name: loc2elastic0-test-work2-elk-awtest1
path.data: "/var/lib/elasticsearch/data/test-work2-elk-awtest1"
path.logs: "/var/log/elasticsearch/test-work2-elk-awtest1"
thread_pool.bulk.queue_size: 500
xpack.security.enabled: false

Does that help? The configuration comes mostly from an Elastic 2 cluster. I think we have adapted it to 5 accordingly where it was neccessary.
There are no firewalls between the Elastic nodes!

Thanks for your help,
Alex

Not sure if it's just a copy/paste error but

Does not match the host names in your curl output.

Actually no.

The curl output has:
loc2elastic4-test-work2-elk-awtest1

the config has the FQDN
loc2elastic1.elk.awtest1.work2.test.(datacenter).(company).io

So the node.name is not equal to the hostname. Is that a problem?
In discovery.zeb.ping.unicast.hosts I have listed all the hostnames (not name configured with node.name!).

It's just me mixing stuff up... They don't match in my setup either :stuck_out_tongue:

But the ones in your elasticsearch config do have to resolve from DNS.

The hostnames are all resolvable :slight_smile: , cluster is green before the restart.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.