Full cluster restart of 400 node cluster

erikstephens · June 3, 2019, 11:44pm

Since I'm able to stop writes, performing updates via full cluster restart seems more attractive than rolling. However, I'm running into problems getting the cluster out of red. From the current master, I'll see even after the data node was come back ready:

[2019-06-03T22:51:59,958][WARN ][o.e.c.NodeConnectionsService] [elasticsearch-foo-master-2] failed to connect to node {elasticsearch-foo-data-294}{a}{b}{192.168.122.56}{192.168.122.56:9300}{} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elasticsearch-foo-data-294][192.168.122.56:9300] connect_timeout[30s]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1316) ~[elasticsearch-6.7.1.jar:6.7.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-6.7.1.jar:6.7.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]

Things I've tried:

Set discovery.zen.publish_timeout: 180s
Set discovery.zen.commit_timeout: 180s
Set discovery.zen.master_election.ignore_non_master_pings: true
Restart the master hoping that newly elected one would be able to handle the updates
Keep the master down since a simple restart wouldn't trigger a re-election

This is 6.7.1 and there are 11 150-shard 1-replica indices. Any settings I should be looking at? Loggers to crank up the verbosity on? Better to restart nodes in smaller batches (eg 25-100)? Thanks!

DavidTurner · June 4, 2019, 7:23am

The exception you quote indicates that elasticsearch-foo-master-2 timed out after 30 seconds while trying to connect to elasticsearch-foo-data-294. That's normally an infrastructural problem - are you sure that these nodes have connectivity?

Is this literally the only message you're seeing, or are there others?

If it's not an infrastructural issue then I'd be interested in working out what exactly is going wrong with this restart, but if you are more concerned with getting the cluster back online then I think I'd recommend starting the masters first and letting them form a cluster, then starting the data nodes in smaller batches and allowing the cluster to settle after each batch has joined.

You might want to set gateway.recover_after_data_nodes on the master nodes to prevent the state recovery process until all the data nodes have joined.

erikstephens · June 6, 2019, 1:10am

Thanks for the response! I mis-spoke about "full cluster restart" - I'm only restarting all data nodes. I would roll the master nodes after data nodes updated, unless there is good reason to upgrade masters first.

This is in kubernetes, so a lot of moving parts to fail Our network has been pretty reliable though, with plenty of headroom.

My next experiment will be upgrading in smaller batches (25,50,100 nodes at a time). Regarding that being the only msg, probably not. I'm pretty sure no other msgs of import and I didn't want to waste a lot of time redacting. I'll pay more attention next time to be 100% sure.

DavidTurner · June 6, 2019, 7:47am

Ah, ok, I have a theory. Are you restarting the data nodes as new containers, with new IP addresses? If so, was [elasticsearch-foo-data-294][192.168.122.56:9300] the old IP address of that node? Are you letting Kubernetes restart the containers essentially at random or are you shutting all the nodes down, letting things settle, and then starting them all back up again?

system · July 4, 2019, 7:47am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Red cluster after master node restart Elasticsearch	4	2102	July 5, 2017
After restarting the master node, data and client nodes cannot discover the master Elasticsearch	11	1354	July 12, 2023
Cluster connection issues when the machines hosting the nodes are restarted for service maintanance Elasticsearch	7	1067	July 6, 2017
Failed, restarting discovery Master not available Elasticsearch	13	4248	November 19, 2019
Elasticsearch 6.1.3 -- failed to discover master after node restart Elasticsearch	6	1295	April 27, 2018

Full cluster restart of 400 node cluster

Related topics