Multi node fail

Hi All
For operational continuity in an multinode cluster i have two questions in my head:
the cluster contains, 14 nodes, 7 nodes on site A, 7 nodes on site B, ES Version 6.4

What if a) one site goes down unexpected and can be restored after lets say power fail? In this case the cluster routing is set to "all" all the time. Does elasticsearch recover by its self?
and b) one site goes down in a planned work and the routing is set to "none" in advance?

Of course, all the shards have at least one replica and are well distributed over the cluster.

some days ago, we had situation b) because of a relocation of site B to site C. After relocation and booting the nodes on site C, all the nodes which should reallocating unassigned shards, was throttled according to the recovery settings (simultaneous recovery was set to 8). the only thing we could do, to go in a green state, was reduce the affected shard replicas to 0.
What is the correct way, to relocate 50% of all the nodes?
Buy new nodes, place these on site C, add nodes on site C to the cluster, and shutdown nodes on site B?
shutdown all nodes on site B without disabling routing?

Thank you very much for your inputs!
A

In a correctly configured cluster you need 3 zones to achieve high availability as a strict majority of master eligible nodes need to be available. With 14 nodes split evenly across two zones your cluster should go into read only mode if one full site is list unless you have more master eligible nodes in the x Zac one that stayed up. Make sure minimum_master_nodes is set correctly as you otherwise could experience data loss without even realising it. What f all your nodes are master eligible it should be set to 8 as that is the majority of 14. I would recommend adding a dedicated master node as tie breaker in a third zone.

Hi @Christian_Dahlqvist
You are absolute right about the master nodes! and of course, we have 3 Master nodes on a complete different Infrastructure running.
For my two scenarios a) and b) above we can say, all the 3 Master nodes stay reachable.

Are you using shard as location awareness to make sure shards are distributed across the zones?

Yes
I've set node.attr.rack_id: site_a and node.attr.rack_id: site_b
according to the site, on which the node is running

If you have allocation awareness set up the replica shards will not get reassigned whole zone B is down so indices will be yellow. Once the zone comes back up the shards will need to be brought up to speed. Hi we’d long this takes depends on the shard count, data volume, recovery concurrency and whether the full shards need to be replaced or not. This need to happen even if if bring up zone C before migrating off zone B but at least this way you always have a replica in place which is less risky.

Yes this was my tough as well.
It was aprox 2000 shard to recover, with at least 1G interconnection. the recover concurrency was set to 8 for incomming and outgoing. but this placed the recovering nodes in to throttled state, which ended up in a deadlock without any progress of recovery.
all the nodes did not made any recovery tasks any more, because of throttling state. Which confused me!

I think as well, to bring up the zone C is less risky and a good solution. But the CFO thinks different... :slight_smile:

That doesn't make sense to me, throttling will not lead to a deadlock here. Adjusting the concurrent recoveries setting may lead to a deadlock, although you tend to need to set it higher than 8 to hit that issue. I recommend leaving it at the default anyway. 6.4 is long past the end of its supported life so you should upgrade for sure. Newer versions have seen improvements in recovery performance.

The manual for more recent versions covers your questions in some depth. Two zones of data nodes is sufficient, but you do need three for your master-eligible nodes.

I think i nailed it down.
The issue was network related. Between DC A and DC B was a Ethernet Link with a MTU of 9000 Bytes. On all nodes was a MTU of 9000 set as well.
Now, between DC A and DC C was only a MTU of 1500Bytes.
The shard replication did fail, because all nodes tried to send the data with a MTU of 9000 Bytes!
Now i changed the MTU of all nodes on DC C down to 1500 Bytes - by knowing the downgrade. However, for our amount of data this should be sufficient.
After setting the MTU to 1500 Bytes on nodes in DC C - all the replicas are successfully created.
As far, Elasticsearch worked as expected

Thank you guys for your replies!

Thanks for circling back @cadirol that makes much more sense. Shard recoveries send data at quite a high rate so you're more likely to exceed something like an MTU limit during recovery.

Can I ask how you found this, as I'd never have thought to check MTU? I guess a good sniff would work, or maybe there's a NIC stat, but likely that'd show up on the link gear, not the node.

Hi @Steve_Mushero
You are right, on ES side you will never get a log message concerning the Network MTU!
My way was: I had a index with 3 replica, with two availability zones you will get the primary and one replica on DC A and the two other replicas on DC B. The Shard recovery worked fine for this particular shard on DC A but not to DC B! So, i knew there must be something in between. And as i am a Linux/Elasticsearch/Network Engineer i have access to all related devices.
Next step was to check network connectivity on the affected Cisco Nexus Switches, and with "show interface xy" you will see a counter called "jumbo" and "giants". And that was it. I checked the network config on the ES node and found the MTU setting of 9000 Bytes.
A ping Test between the nodes with "ping -s 3000 " confirmed me the MTU issue.
Now i did set the MTU on all nodes in DC B to 1500 Bytes. And all worked well as expected.

It would be a nice feature, if Elasticsearch could detect a MTU issue! But i don't know if it is worth to post a Feature request?!

However, Happy Elasticsearching and never forget the MTU of your Network!

1 Like

I don't think we can do that, userspace applications like Elasticsearch don't get to see this sort of detail. All that Elasticsearch sees is that some node-to-node messages don't get through and eventually the connection drops. I wouldn't call this a "deadlock" tho, you would see shard recoveries failing with a NodeDisconnectedException or similar and giving up after a few retries. Fortunately a NodeDisconnectedException is a clear indication that there's something wrong on the network, and with that clue it shouldn't take long to find the culprit.

We recently added docs that would help you to configure Elasticsearch to fail faster in this situation at least.

Incidentally there's nothing inherently wrong with having a mix of MTUs on your network unless it also does something silly like blocking ICMP fragmentation needed messages, or you've otherwise hobbled PMTUD. Probably worth investigating that more deeply, it'd be better to use jumbo frames where you can.

Thank you for this hint! As i understand right, if NodeDisconnectedException in the logs occur, the node is vanished from the cluster inventory. right? i think during our journey with cluster restoring, all node were connected to the cluster!

The TCP retransmission timeout would be a test worth! thank you!

No, that's not right, you can get a NodeDisconnectedException without the node leaving the cluster if, for instance, the disconnection is between two data nodes and doesn't involve the master.

Oh, thank you for this hint! Yes something like NodeDisconnectedException would lead to a network issue!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.