Rolling restart master election takes 2min

AlexW · October 16, 2017, 2:03pm

Hi!
On my Elasticsearch 5.3.2 cluster it takes nearly 2min to elect a new master. My screnario is a rolling restart. The master is restarted last. Once it gets restarted the cluster "panics" and searches for a new master. This takes about 2min.
The logfiles repeats a MasterNotDiscoveredException until the new one is found.
I added the logfiles of my restart script, it should be clear what is going on (I hope):

2017-10-16 15:54:13,874 INFO     Restarting loc2elastic2-test-work2-elk-awtest1 (5 of 5 nodes)
2017-10-16 15:54:13,976 INFO     Changing shard allocation to none resulted in: {"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"none"}}}}}
2017-10-16 15:54:13,976 INFO     Restarting ES Node: loc2elastic2-test-work2-elk-awtest1
2017-10-16 15:54:14,140 INFO     Connected (version 2.0, client OpenSSH_6.6.1p1)
2017-10-16 15:54:14,385 INFO     Authentication (publickey) successful!
2017-10-16 15:54:14,723 INFO     detected service name: elasticsearch-test-work2-elk-awtest1
2017-10-16 15:54:18,206 INFO      * Stopping Elasticsearch Server test-work2-elk-awtest1
...done.
* Starting Elasticsearch Server test-work2-elk-awtest1
[2017-10-16T13:54:17,743][WARN ][o.e.c.l.LogConfigurator  ] ignoring unsupported logging configuration file [/etc/elasticsearch/test-work2-elk-awtest1/logging.yml], logging is configured via [/etc/elasticsearch/test-work2-elk-awtest1/log4j2.properties]
...done.

2017-10-16 15:54:18,299 INFO     Server loc2elastic2-test-work2-elk-awtest1 has not joined the cluster yet. Waiting 5 more seconds.
2017-10-16 15:54:23,393 INFO     Server loc2elastic2-test-work2-elk-awtest1 has not joined the cluster yet. Waiting 5 more seconds.
2017-10-16 15:54:28,612 INFO     Server loc2elastic2-test-work2-elk-awtest1 has joined the cluster.
2017-10-16 15:54:58,710 INFO     Changing shard allocation to all resulted in: {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}
2017-10-16 15:54:58,710 INFO     Retrying...
2017-10-16 15:55:28,819 INFO     Changing shard allocation to all resulted in: {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}
2017-10-16 15:55:28,819 INFO     Retrying...
2017-10-16 15:55:58,913 INFO     Changing shard allocation to all resulted in: {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}
2017-10-16 15:55:58,913 INFO     Retrying...
2017-10-16 15:56:21,683 INFO     Changing shard allocation to all resulted in: {"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}
2017-10-16 15:56:21,879 INFO     Waiting for green, current cluster state is: yellow
2017-10-16 15:56:26,981 INFO     Waiting for green, current cluster state is: yellow
2017-10-16 15:56:32,162 INFO     Waiting for green, current cluster state is: green

Any idea if I have configured something wrong?

Take care,
Alex

A_B · October 17, 2017, 7:32am

Hi Alex,

what does your elasticsearch config look like? Actually, what does something like curl -s 'ES-HOST:9200/_cat/nodes' show?

Almost sounds like you only have one master-eligible node in the cluster.. Or maybe there's some firewall rules that block other master-eligible nodes.

-AB

P.S. this is what I see in one of my ES clusters

# curl -s 'es-host0:9200/_cat/nodes'
10.0.10.95 31 99 4 2.02 2.08 2.07 di - es-host1-es-00
10.0.10.97 60 96 8 4.35 3.20 2.71 md * es-host0-es-03
10.0.10.97 25 96 8 4.35 3.20 2.71 di - es-host0-es-02
10.0.10.95 65 99 4 2.02 2.08 2.07 di - es-host1-es-01
10.0.10.95 64 99 3 2.02 2.08 2.07 di - es-host1-es-02
10.0.10.97 68 96 8 4.35 3.20 2.71 md - es-host0-es-04
10.0.10.97 44 96 7 4.35 3.20 2.71 di - es-host0-es-01
10.0.10.95 41 99 4 2.02 2.08 2.07 md - es-host1-es-03
10.0.10.95 34 99 4 2.02 2.08 2.07 md - es-host1-es-04
10.0.10.97 38 96 7 4.35 3.20 2.71 di - es-host0-es-00

* is the current master (second row, 9th column)
m in the 8th column is for master-eligible node

AlexW · October 17, 2017, 8:16am

Hi,

and thanks for your answer. All of my nodes are master eligible nodes:

root@loc2elastic0.elk.awtest1.work2.test.be-stg1:~# curl http://$( hostname -f ):9200/_cat/nodes
10.31.253.31 14 73 0 0.01 0.02 0.00 mdi - loc2elastic4-test-work2-elk-awtest1
10.31.253.17 13 72 0 0.00 0.00 0.00 mdi - loc2elastic2-test-work2-elk-awtest1
10.31.253.14 13 74 0 0.05 0.03 0.00 mdi - loc2elastic0-test-work2-elk-awtest1
10.31.253.18  8 70 0 0.00 0.00 0.00 mdi * loc2elastic3-test-work2-elk-awtest1
10.31.253.16 16 72 0 0.00 0.00 0.00 mdi - loc2elastic1-test-work2-elk-awtest1

The config looks like this (I removed all comments and anonymized the hostnames):

---
bootstrap.memory_lock: true
cluster.name: test-work2-elk-awtest1
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts:
- loc2elastic1.elk.awtest1.work2.test.xxx
- loc2elastic2.elk.awtest1.work2.test.xxx
- loc2elastic3.elk.awtest1.work2.test.xxx
- loc2elastic4.elk.awtest1.work2.test.xxx
discovery.zen.ping_timeout: 60s
http.host: 10.31.253.14
http.port: 9200
network.host: 10.31.253.14
network.publish_host: 10.31.253.14
node.data: true
node.ingest: true
node.master: true
node.name: loc2elastic0-test-work2-elk-awtest1
path.data: "/var/lib/elasticsearch/data/test-work2-elk-awtest1"
path.logs: "/var/log/elasticsearch/test-work2-elk-awtest1"
thread_pool.bulk.queue_size: 500
xpack.security.enabled: false

Does that help? The configuration comes mostly from an Elastic 2 cluster. I think we have adapted it to 5 accordingly where it was neccessary.
There are no firewalls between the Elastic nodes!

Thanks for your help,
Alex

A_B · October 17, 2017, 10:56am

Not sure if it's just a copy/paste error but

Does not match the host names in your curl output.

AlexW · October 17, 2017, 11:09am

Actually no.

The curl output has:
loc2elastic4-test-work2-elk-awtest1

the config has the FQDN
loc2elastic1.elk.awtest1.work2.test.(datacenter).(company).io

So the node.name is not equal to the hostname. Is that a problem?
In discovery.zeb.ping.unicast.hosts I have listed all the hostnames (not name configured with node.name!).

A_B · October 17, 2017, 12:31pm

It's just me mixing stuff up... They don't match in my setup either

But the ones in your elasticsearch config do have to resolve from DNS.

AlexW · October 17, 2017, 1:43pm

The hostnames are all resolvable , cluster is green before the restart.

system · November 14, 2017, 1:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch nodes did not elect master even after failure to discover master Elasticsearch	1	1165	July 5, 2017
Master election issue? Elasticsearch	4	370	July 6, 2017
New nodes do not consistently find existing master Elasticsearch	2	261	July 6, 2017
Master election takes too long Elasticsearch	20	1998	May 16, 2019
Restarting a 2-node elasticsearch cluster with zero downtime Elasticsearch	2	527	July 6, 2017

Rolling restart master election takes 2min

Related topics