After I restart a node or a node disconnected with cluster, all data will copy from remote, why not recovery from local

es version 5.2.2

only few index will write, and others are BLOCK.WRITE setting

a node left will cost 12+ hours to cluster becoming green

In 5.x and prior, recovery is file-based. A replica shard will be recovered from a primary by comparing the files that it has on disk to the files that the primary has on disk. If the replica and primary had identical files on disk when the replica went offline and the only changes to the primary while the replica was offline were the addition of new segment files, then those new files will be copied over. However, this is rarely the case as a replica and primary shard are independent Lucene indices which means they flush segments and merge segments independently so their files on disk are rarely the same. This leads to recovery almost always being a full recovery where all segment files are copied from the primary to the replica shard.

In 1.6 sync flush helped address this problem for shards that are idle. If a shard has not seen a write for an inactivity period (defaults to five minutes) then the primary and replica shard will write an identical marker to disk. This marker is destroyed if there are any writes to the index. If a replica falls offline and recovers from the primary, the presence of this marker even though their files might be different is a way to short-circuit file-based recovery. This is good, but it does not help with active indices.

Synthesizing the above then, in 5.x and prior, for a large shard we are often looking at full recoveries (even in cases where there might have been no writes!) which means copying the entire shard across the network. If there are many large shards, this can be time-consuming indeed.

In 6.0 we have added operation-based recovery. Now, when a replica falls offline it will reach out to the primary and try to recover only the operations it missed while it was offline. Whether or not this can succeed depends on how long the replica is offline and what operations are still present in the translog of the primary shard. We have increased the default retention policy on the translog to increase the likelihood of such a recovery occurring. If an operation-based recovery can not succeed, the recovery will fall back to a file-based recovery. We think that this will have a material impact on this problem.

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.