We have a process (Patch Orchestration in Azure Service Fabric) that patches our VMs. As I expect node reboots during this, I have set a node_left delay to cover the reboot time (i.e., ~10 min). This generally works well, and the relocation waits for the allotted time. Usually, the node comes back up and the shards are quickly initialized. We had a scenario recently where we had a single node that was having memory issues (resolved by manually rebooting the node after we tracked it down). Due to high GC cycles, the node was having network drops (I assume). The node usually only dropped for 1 to 4 minutes. The system was still kicking into relocation. This meant that we were in almost constant relocation as the cycle would repeat shortly after the relocation finished. This caused excess load on the system with all the shards moving and initializing. I am trying to figure out why the reloc was not waiting (I resolved the issue with the reboot before thinking to check the _cat/shards api to find the unassigned reason).
I see a reference to a number of other unassigned reasons (i.e., ALLOCATION_FAILED, REPLICA_ADDED, etc.). I am wondering if the issue could be that the node is unreachable (due to the network drop), but is not classified as NODE_LEFT, so the delay I have in settings is not used. I have not seen any info for setting a reloc delay for anything other than NODE_LEFT, however. Is there a global way to delay relocation?
The recovery delay only delays recoveries to different nodes. If the original node comes back, Elasticsearch will immediately start allocating shards back to it.
I have some telemetry and I realize I was wrong. It is respecting the node left delay. I have four identical clusters (they are in four different regions). I am updating a setting that takes a node down for 4 - 5 minutes. On all four clusters, it does not do any relocation during this time. On three of the clusters, when the node comes back, the shards reinitialize in a couple minutes. For the fourth cluster, it waits, and when the node comes back, it starts pulling at least one shard from another node (i.e., relocating over the wire). As I said, all three clusters are identical (deployed with same template). This only ever happens in one of the clusters.
Is there any reason why a specific cluster would pull a shard from another node rather than just initializing the local copy?
More data, more clarity, more confusion. It looks like it is not that the one region is affected. It seems that things work correctly, but then occasionally one shard needs to be relocated.
In this chart, the red lines are the count of unassigned shards (12) when a node goes down. when the node comes back, most of the shards immediately recover (these are shards without a lot of activity). The ones where we have indexing ongoing need to initialize (the orange lines). This usually only takes a couple minutes. The long trailing orange lines are were an initization is happening from another node (as evidenced by source_node and target_node in _cat/recovery).
Is it possible that since the node is back, the node_left delay is moot, but that there are more recoveries than what which can happen in parallel (or something) so one ends up being a relocation?
Hmm I think it's possible Elasticsearch decides that it needs to move some shards around to rebalance the cluster when the node comes back, although I see that this is undesirable since the cluster should have been balanced before the node left. We're working on some changes to the shard allocator which should address this sort of corner case I think.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.