Moving shards is slow - Solution

I would like to add my solution to Moving shards is slow

The solution has two parts:

  • Ensure the primaries are balanced over all the nodes so the recovery load is well distributed. This was done with an external shard balancer: It swaps primary shards with replicas (using a series of moves) until they are evenly distributed. This increases the total shard movement, but decreases recovery time significantly when it is needed.
  • Recover only one-shard-per-node-at-a-time, only one-primary-per-node-at-a-time. Something terrible happens when you ask a node to recover/move more than one shard at a time. Maybe it is a classic "it takes twice as long to do twice as much", but it seems much worse than that. When shards must recover, we track which pair of nodes (source node and destination node) is involved and ensure those nodes are not already busy with another shard recovery. By blocking recoveries until the nodes are free, the overall recovery time is reduced significantly. This technique is also done with the same external shard balancer.

Using this technique, shard recovery can be reduced from a week (or never, in some pathological shard/node cases), to a couple of days. The cluster went from constantly yellow, to almost always green. Life is good now.

You can also turn off ingestion during recovery: It noticeably improves recovery time, but that is common knowledge.

Note: The cluster is about 40 EC2 nodes, 4K shards (2K primary shards), median shard size is about 20Gb, total primaries sum to 35Tb on disk.

What version are you running?

With things like (sync) flush it shouldn't take a week to handle that many shards.

The original post was version 6.1.2. My current version is 6.5.4.

During big recoveries, like upgrading half the nodes, the shards would start out recovering quickly and then an hour, or so, slow to a crawl: Down to kilobytes-per-second. I could not get my recovery rate to be any faster, the math would indicate it would take months at this rate. I think I covered this in my other post. The older versions of ES did not have this problem; with version 6 my cluster was always yellow: There were always some replicas that were constantly initializing.

For the longest time I thought the slowness was caused by EC2 network limits between machines: Testing random bytes file transfer between machines looked good, but maybe AWS has a hidden network credit system that allowed bursty transfers, but clamped down on long-term high-volume network activity between machines. A few months ago the lack-of-shard-recovery become dire; I had to figure out what the problem was, and somehow measure properties of this "AWS network clamp" so I could work around it.

So I opened multiple windows to all my nodes, stop all the ingestion, installed network monitoring and watched the network over days while nodes got allocated: The AWS network clamp never appeared. The nodes were transferring data, recovering at just kilobytes per second, but there was still network capacity to do file transfers between nodes. So I try manually requesting shard allocation, one-at-a-time: It was fast. I concluded there must be some pathology when I ask for more than one shard to allocate at a time.

Due to my cluster's particular configuration, all the primaries end up on just a few machines. During recovery, those nodes are doing all the work, and if I ask the cluster to allocate two replicas that have primaries on the same node, then recovery goes extra slow; it take a day, or more, or the allocation just cancels after some number of hours: The replicas move from initializing to unassigned.

I added code to my shard balancer to consider this limitation: Ensuring a node is involved in recovering just one shard at a time. Replication is fast. My cluster is always green.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.