Full cluster restart of 400 node cluster

Since I'm able to stop writes, performing updates via full cluster restart seems more attractive than rolling. However, I'm running into problems getting the cluster out of red. From the current master, I'll see even after the data node was come back ready:

[2019-06-03T22:51:59,958][WARN ][o.e.c.NodeConnectionsService] [elasticsearch-foo-master-2] failed to connect to node {elasticsearch-foo-data-294}{a}{b}{192.168.122.56}{192.168.122.56:9300}{} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elasticsearch-foo-data-294][192.168.122.56:9300] connect_timeout[30s]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1316) ~[elasticsearch-6.7.1.jar:6.7.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-6.7.1.jar:6.7.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]

Things I've tried:

  • Set discovery.zen.publish_timeout: 180s
  • Set discovery.zen.commit_timeout: 180s
  • Set discovery.zen.master_election.ignore_non_master_pings: true
  • Restart the master hoping that newly elected one would be able to handle the updates
  • Keep the master down since a simple restart wouldn't trigger a re-election

This is 6.7.1 and there are 11 150-shard 1-replica indices. Any settings I should be looking at? Loggers to crank up the verbosity on? Better to restart nodes in smaller batches (eg 25-100)? Thanks!

The exception you quote indicates that elasticsearch-foo-master-2 timed out after 30 seconds while trying to connect to elasticsearch-foo-data-294. That's normally an infrastructural problem - are you sure that these nodes have connectivity?

Is this literally the only message you're seeing, or are there others?

If it's not an infrastructural issue then I'd be interested in working out what exactly is going wrong with this restart, but if you are more concerned with getting the cluster back online then I think I'd recommend starting the masters first and letting them form a cluster, then starting the data nodes in smaller batches and allowing the cluster to settle after each batch has joined.

You might want to set gateway.recover_after_data_nodes on the master nodes to prevent the state recovery process until all the data nodes have joined.

Thanks for the response! I mis-spoke about "full cluster restart" - I'm only restarting all data nodes. I would roll the master nodes after data nodes updated, unless there is good reason to upgrade masters first.

This is in kubernetes, so a lot of moving parts to fail :slight_smile: Our network has been pretty reliable though, with plenty of headroom.

My next experiment will be upgrading in smaller batches (25,50,100 nodes at a time). Regarding that being the only msg, probably not. I'm pretty sure no other msgs of import and I didn't want to waste a lot of time redacting. I'll pay more attention next time to be 100% sure.

Ah, ok, I have a theory. Are you restarting the data nodes as new containers, with new IP addresses? If so, was [elasticsearch-foo-data-294][192.168.122.56:9300] the old IP address of that node? Are you letting Kubernetes restart the containers essentially at random or are you shutting all the nodes down, letting things settle, and then starting them all back up again?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.