Red cluster after master node restart

BS1 · March 28, 2016, 11:26pm

We're in the process of reconfiguring our elasticsearch clusters to have separate client and master nodes. While adding the master nodes is intrusive, the rolling restart of data nodes after setting node.master to false turned out to be disruptive, causing a red cluster.

Our setup, outlined:

ES 1.7.5
4 client nodes
12 data nodes
3 master nodes

Our elasticsearch.yml has

discovery.zen.ping.unicast.hosts: ["esmaster1", "esmaster2", "esmaster3", "esclient1", "esclient2", "esclient3", "esclient4", "esdata1", "esdata2", "esdata3", "esdata4", "esdata5", "esdata6", "esdata7", "esdata7", "esdata8", "esdata9", "esdata10", "esdata11", "esdata12"]

Before the reconfig, discovery.zen.minimum_master_nodes was set to 6 int((number of nodes / 2) + 1). After the reconfig, we set this to 2 based on 3 master nodes.

The rolling restart was done as follows for each node:

stop all indexers
disable shard allocations
shut down elasticsearch on node using api
restart elasticsearch service
enable shard allocations
wait until cluster is green
start indexers

Once indexers were caught up with realtime, I would proceed with the next node. This went according to plan until the last data node was restarted, which was the master at the time (I saved it for last, perhaps I shouldn't have ?)

These graphs show exactly the timeframe we were impacted:

[continuing in a follow-up post as I have exceeded my 5000 character max]

BS1 · March 28, 2016, 11:26pm

[continuing from previous message]

After the master node was restarted, there's a flood of messages like this in the logs, for each host in the cluster:

[2016-03-25 17:09:47,921][DEBUG][transport.netty          ] [esdata1] connected to node [[#zen_unicast_521_5p2BpJWNQQ681cPF2f9duw#][esdata7][inet[/10.112.34.163:9300]]{master=false}]
[2016-03-25 17:09:49,423][DEBUG][transport.netty          ] [esata1] connected to node [[#zen_unicast_534_poJN16rLTMqvMRy9ZFZBoA#][esmaster3][inet[/10.112.34.196:9300]]{data=false}]
[2016-03-25 17:09:49,424][DEBUG][transport.netty          ] [esdata1] connected to node [[#zen_unicast_547__KZq2UIwRRGM383K1UWzGg#][esclient4][inet[/10.112.34.193:9300]]{data=false, master=false}]
[2016-03-25 17:09:50,469][DEBUG][action.admin.cluster.health] [esdata1] no known master node, scheduling a retry
[2016-03-25 17:09:50,931][DEBUG][transport.netty          ] [esdata1] disconnecting from [[#zen_unicast_544_NX9W8hKqRzOhPa92RNem0A#][esclient3][inet[/10.112.34.175:9300]]{data=false, master=false}] due to explicit disconnect call
[2016-03-25 17:09:50,932][DEBUG][transport.netty          ] [esata1] disconnecting from [[#zen_unicast_13#][esdata1][inet[esdata6/10.112.34.162:9300]]] due to explicit disconnect call

This goes on for a while, until something starts happening:

[2016-03-25 17:10:17,695][INFO ][discovery.zen            ] [esdata1] failed to send join request to master [[esmaster2][2SDaJlgRS1GcTyzmlIDI9g][esmaster2][inet[/10.112.34.195:9300]]{data=false}], reason [RemoteTransportException[[esmaster2][inet[/10.112.34.195:9300]][internal:discovery/zen/join]]
; nested: ElasticsearchIllegalStateException[Node [[esmaster2][2SDaJlgRS1GcTyzmlIDI9g][esmaster2.][inet[/10.112.34.195:9300]]{data=false}]
not master for join request from [[esdata1][3j0Njy7xTRyr6bmua4fZdA][esdata1.][inet[/10.112.34.157:9300]]{master=false}]]; ], tried [3] times
[2016-03-25 17:10:17,695][DEBUG][cluster.service          ] [esdata1] processing [finalize_join ([esmaster2][2SDaJlgRS1GcTyzmlIDI9g][esmaster2][inet[/10.112.34.195:9300]]{data=false})]: execute
[2016-03-25 17:10:17,695][DEBUG][cluster.service          ] [esdata1] processing [finalize_join ([esmaster2][2SDaJlgRS1GcTyzmlIDI9g][esmaster2][inet[/10.112.34.195:9300]]{data=false})]: took 0s no change in cluster_state

This goes on for ~ 7 minutes, until finally

[2016-03-25 17:17:12,543][DEBUG][discovery.zen            ] [esdn1] filtered ping responses: (filter_client[true], filter_data[false])
[2016-03-25 17:17:13,083][DEBUG][discovery.zen.fd         ] [esdata1] [master] restarting fault detection against master [[esmaster1][gi7qMafJS4OXHk9ujvAlIQ][esmaster1][inet[/10.112.34.194:9300]]{data=false}], reason [new cluster state received and we are monitoring the wrong master [null]]
[2016-03-25 17:17:13,084][DEBUG][discovery.zen            ] [esdata1] got first state from fresh master [gi7qMafJS4OXHk9ujvAlIQ]
[2016-03-25 17:17:13,085][DEBUG][cluster.service          ] [esdata1] cluster state updated, version [54], source [zen-disco-receive(from master [[esmaster1][gi7qMafJS4OXHk9ujvAlIQ][esmaster1][inet[/10.112.34.194:9300]]{data=false}])]

BS1 · March 28, 2016, 11:27pm

Continuing

[2016-03-25 17:17:13,086][INFO ][cluster.service ] [esdata1] detected_master [esmaster1][gi7qMafJS4OXHk9ujvAlIQ][esmaster1][inet[/10.112.34.194:9300]]{data=false}, added {[esdata5][dmiJ7MqlR_SaYck7a96cfg][esdata5][inet[/10.112.34.161:9300]]{master=false},[esdata7][5p2BpJWNQQ681cPF2f9duw][esdata7][inet[/10.112.34.163:9300]]{master=false},[esdata11][s9BHm86oQOaBh18j8HZCOA][esdata11][inet[/10.112.34.181:9300]]{master=false},[esmaster2][y6WZF9E5Rs-uaeBcojYWHQ][esmaster2][inet[/10.112.34.195:9300]]{data=false},[esdata12][Y6mI96BZQV6__5NG7lkJAA][esdata12][inet[/10.112.34.182:9300]]{master=false},[esdata9][4fYYgFa7S0KtXUKP9spbxw][esdata9][inet[/10.112.34.184:9300]]{master=false},[esmaster3][2YYaoapkQoabN1UgEUGHMQ][esmaster3][inet[/10.112.34.196:9300]]{data=false},[esdata6][Bcobu5jhSXmWUWdcqdB2vg][esdata6][inet[/10.112.34.162:9300]]{master=false},[esdata8][FlMTzCkQSIeux0bD83eCeg][esdata8][inet[/10.112.34.183:9300]]{master=false},[esclient4][_KZq2UIwRRGM383K1UWzGg][esclient4][inet[/10.112.34.193:9300]]{data=false, master=false},[esdata4][0US-i58rQH2H4aLsraG1Gw][esdata4][inet[/10.112.34.160:9300]]{master=false},[esclient3][NX9W8hKqRzOhPa92RNem0A][esclient3][inet[/10.112.34.175:9300]]{data=false, master=false},[esmaster1][gi7qMafJS4OXHk9ujvAlIQ][esmaster1][inet[/10.112.34.194:9300]]{data=false},[esclient1][-sbH82UhT3Sb0AoaDGkTNw][esclient1][inet[/10.112.34.173:9300]]{data=false, master=false},[esdata3][X2ZLf1eaT2KwEwV5YGeB3A][esdata3][inet[/10.112.34.159:9300]]{master=false},[esdata2][H2WPfE12Rk6GWXios2gitA][esdata2][inet[/10.112.34.158:9300]]{master=false},[esdata10][6kAHYTRGSE6KNBZOxTNyfQ][esdata10][inet[/10.112.34.180:9300]]{master=false},[esclient2][EmgQ8UwkQeKNnPmzUDeTOg][esclient2][inet[/10.112.34.174:9300]]{data=false, master=false},}, reason: zen-disco-receive(from master [[esmaster1][gi7qMafJS4OXHk9ujvAlIQ][esmaster1][inet[/10.112.34.194:9300]]{data=false}])

We have to go through the same exercise in several clusters. What needs to be improved with this procedure in order to allow the cluster to survive the original master node restart ? Should I update discovery.zen.unicast.host to only contain the new dedicated master nodes ?

warkolm · March 29, 2016, 1:51am

Definitely.

Topic		Replies	Views
Master nodes do not detect the other masters after service restart Elasticsearch	10	5403	August 21, 2019
Cluster connection issues when the machines hosting the nodes are restarted for service maintanance Elasticsearch	7	1053	July 6, 2017
Restarting a cluster with existing data - Status Red? Elasticsearch	10	1305	July 6, 2017
Master election issue? Elasticsearch	4	371	July 6, 2017
Shutdown master means breakdown the cluster's service? Elasticsearch	8	2335	July 6, 2017

Red cluster after master node restart

Related topics