Multi node fail

cadirol · August 8, 2020, 9:29am

Hi All
For operational continuity in an multinode cluster i have two questions in my head:
the cluster contains, 14 nodes, 7 nodes on site A, 7 nodes on site B, ES Version 6.4

What if a) one site goes down unexpected and can be restored after lets say power fail? In this case the cluster routing is set to "all" all the time. Does elasticsearch recover by its self?
and b) one site goes down in a planned work and the routing is set to "none" in advance?

Of course, all the shards have at least one replica and are well distributed over the cluster.

some days ago, we had situation b) because of a relocation of site B to site C. After relocation and booting the nodes on site C, all the nodes which should reallocating unassigned shards, was throttled according to the recovery settings (simultaneous recovery was set to 8). the only thing we could do, to go in a green state, was reduce the affected shard replicas to 0.
What is the correct way, to relocate 50% of all the nodes?
Buy new nodes, place these on site C, add nodes on site C to the cluster, and shutdown nodes on site B?
shutdown all nodes on site B without disabling routing?

Thank you very much for your inputs!
A

Christian_Dahlqvist · August 8, 2020, 9:40am

In a correctly configured cluster you need 3 zones to achieve high availability as a strict majority of master eligible nodes need to be available. With 14 nodes split evenly across two zones your cluster should go into read only mode if one full site is list unless you have more master eligible nodes in the x Zac one that stayed up. Make sure minimum_master_nodes is set correctly as you otherwise could experience data loss without even realising it. What f all your nodes are master eligible it should be set to 8 as that is the majority of 14. I would recommend adding a dedicated master node as tie breaker in a third zone.

cadirol · August 8, 2020, 9:58am

Hi @Christian_Dahlqvist
You are absolute right about the master nodes! and of course, we have 3 Master nodes on a complete different Infrastructure running.
For my two scenarios a) and b) above we can say, all the 3 Master nodes stay reachable.

Christian_Dahlqvist · August 8, 2020, 10:04am

Are you using shard as location awareness to make sure shards are distributed across the zones?

cadirol · August 8, 2020, 10:06am

Yes
I've set node.attr.rack_id: site_a and node.attr.rack_id: site_b
according to the site, on which the node is running

Christian_Dahlqvist · August 8, 2020, 10:11am

If you have allocation awareness set up the replica shards will not get reassigned whole zone B is down so indices will be yellow. Once the zone comes back up the shards will need to be brought up to speed. Hi we’d long this takes depends on the shard count, data volume, recovery concurrency and whether the full shards need to be replaced or not. This need to happen even if if bring up zone C before migrating off zone B but at least this way you always have a replica in place which is less risky.

cadirol · August 8, 2020, 10:16am

Yes this was my tough as well.
It was aprox 2000 shard to recover, with at least 1G interconnection. the recover concurrency was set to 8 for incomming and outgoing. but this placed the recovering nodes in to throttled state, which ended up in a deadlock without any progress of recovery.
all the nodes did not made any recovery tasks any more, because of throttling state. Which confused me!

I think as well, to bring up the zone C is less risky and a good solution. But the CFO thinks different...

DavidTurner · August 9, 2020, 4:52pm

That doesn't make sense to me, throttling will not lead to a deadlock here. Adjusting the concurrent recoveries setting may lead to a deadlock, although you tend to need to set it higher than 8 to hit that issue. I recommend leaving it at the default anyway. 6.4 is long past the end of its supported life so you should upgrade for sure. Newer versions have seen improvements in recovery performance.

The manual for more recent versions covers your questions in some depth. Two zones of data nodes is sufficient, but you do need three for your master-eligible nodes.

cadirol · August 10, 2020, 12:20pm

I think i nailed it down.
The issue was network related. Between DC A and DC B was a Ethernet Link with a MTU of 9000 Bytes. On all nodes was a MTU of 9000 set as well.
Now, between DC A and DC C was only a MTU of 1500Bytes.
The shard replication did fail, because all nodes tried to send the data with a MTU of 9000 Bytes!
Now i changed the MTU of all nodes on DC C down to 1500 Bytes - by knowing the downgrade. However, for our amount of data this should be sufficient.
After setting the MTU to 1500 Bytes on nodes in DC C - all the replicas are successfully created.
As far, Elasticsearch worked as expected

Thank you guys for your replies!

DavidTurner · August 10, 2020, 12:29pm

Thanks for circling back @cadirol that makes much more sense. Shard recoveries send data at quite a high rate so you're more likely to exceed something like an MTU limit during recovery.

Steve_Mushero · August 10, 2020, 2:10pm

Can I ask how you found this, as I'd never have thought to check MTU? I guess a good sniff would work, or maybe there's a NIC stat, but likely that'd show up on the link gear, not the node.

cadirol · August 11, 2020, 8:01am

Hi @Steve_Mushero
You are right, on ES side you will never get a log message concerning the Network MTU!
My way was: I had a index with 3 replica, with two availability zones you will get the primary and one replica on DC A and the two other replicas on DC B. The Shard recovery worked fine for this particular shard on DC A but not to DC B! So, i knew there must be something in between. And as i am a Linux/Elasticsearch/Network Engineer i have access to all related devices.
Next step was to check network connectivity on the affected Cisco Nexus Switches, and with "show interface xy" you will see a counter called "jumbo" and "giants". And that was it. I checked the network config on the ES node and found the MTU setting of 9000 Bytes.
A ping Test between the nodes with "ping -s 3000 " confirmed me the MTU issue.
Now i did set the MTU on all nodes in DC B to 1500 Bytes. And all worked well as expected.

It would be a nice feature, if Elasticsearch could detect a MTU issue! But i don't know if it is worth to post a Feature request?!

However, Happy Elasticsearching and never forget the MTU of your Network!

DavidTurner · August 11, 2020, 8:33am

I don't think we can do that, userspace applications like Elasticsearch don't get to see this sort of detail. All that Elasticsearch sees is that some node-to-node messages don't get through and eventually the connection drops. I wouldn't call this a "deadlock" tho, you would see shard recoveries failing with a NodeDisconnectedException or similar and giving up after a few retries. Fortunately a NodeDisconnectedException is a clear indication that there's something wrong on the network, and with that clue it shouldn't take long to find the culprit.

We recently added docs that would help you to configure Elasticsearch to fail faster in this situation at least.

Incidentally there's nothing inherently wrong with having a mix of MTUs on your network unless it also does something silly like blocking ICMP fragmentation needed messages, or you've otherwise hobbled PMTUD. Probably worth investigating that more deeply, it'd be better to use jumbo frames where you can.

cadirol · August 11, 2020, 11:32am

Thank you for this hint! As i understand right, if NodeDisconnectedException in the logs occur, the node is vanished from the cluster inventory. right? i think during our journey with cluster restoring, all node were connected to the cluster!

The TCP retransmission timeout would be a test worth! thank you!

DavidTurner · August 11, 2020, 11:34am

No, that's not right, you can get a NodeDisconnectedException without the node leaving the cluster if, for instance, the disconnection is between two data nodes and doesn't involve the master.

cadirol · August 11, 2020, 11:47am

Oh, thank you for this hint! Yes something like NodeDisconnectedException would lead to a network issue!

system · September 8, 2020, 11:47am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Configuring multi node ES setup in aws to withstand AZ going down Elasticsearch	1	341	July 6, 2017
Unexpected node failure by using transport client Elasticsearch	3	639	July 6, 2017
Architecture for site recovery Elasticsearch	9	445	August 26, 2019
Network failure resiliency Elasticsearch	14	754	July 6, 2017
Elasticsearch multi node failover Elasticsearch	3	976	September 27, 2017

Multi node fail

Related topics