I'm having issues with getting recovery to use sync markers to recover from local replicas during planned full cluster restarts. Recovering from "sealed indices" works for about half of replicas, then for whatever reason, the other half gets copies over the network. I'm currently running v1.6.2.
My restart flow is as follows:
Stop all indexing
Execute a synced flush on all indices
Disable shard allocation
Rolling restart of all data nodes
Enable shard allocation after ensuring all data nodes have joined the cluster
When the cluster is fully recovered, restart indexing
My suspicion is that as unallocated shards get reallocated, the allocator is unaware that some nodes have local copies of those shards and allocates to some other node, preventing local recovery. Watching cluster state corroborates this -- some nodes are allocated way more shards than others.
Say we have a cluster with data nodes A, B, and C and an index a with 3 shards and 1 replica (we'll name shards a1p for primary shard 1 of index a, a0r for replica shard 0 of index a. At steady state, allocation for these shards will be { A: [a0p, a1r], B: [a1p, a2r], C: [a2p, a0r] }. After a rolling restart of A, B, and C in that order but before reenabling shard allocation, the allocation ends up being { A:[a0p, a1p], B: [a2p], unallocated: [a0r, a1r, a2r] }. After enabling shard allocation, a0r gets allocated on B, but of course B doesn't have that shard. B ends up copying it over the network from A. a0r could have been allocated to C (and is definitely preferable), but for whatever reason doesn't.
Does this make sense at all, and if so, is this a known issue? Synced flushes definitely speed up recovery, and it's be great if we reused all local shards when reallocating.