Moving shards is slow - Solution

klahnakoski · April 16, 2020, 9:52pm

I would like to add my solution to Moving shards is slow

The solution has two parts:

Ensure the primaries are balanced over all the nodes so the recovery load is well distributed. This was done with an external shard balancer: It swaps primary shards with replicas (using a series of moves) until they are evenly distributed. This increases the total shard movement, but decreases recovery time significantly when it is needed.
Recover only one-shard-per-node-at-a-time, only one-primary-per-node-at-a-time. Something terrible happens when you ask a node to recover/move more than one shard at a time. Maybe it is a classic "it takes twice as long to do twice as much", but it seems much worse than that. When shards must recover, we track which pair of nodes (source node and destination node) is involved and ensure those nodes are not already busy with another shard recovery. By blocking recoveries until the nodes are free, the overall recovery time is reduced significantly. This technique is also done with the same external shard balancer.

Using this technique, shard recovery can be reduced from a week (or never, in some pathological shard/node cases), to a couple of days. The cluster went from constantly yellow, to almost always green. Life is good now.

You can also turn off ingestion during recovery: It noticeably improves recovery time, but that is common knowledge.

Note: The cluster is about 40 EC2 nodes, 4K shards (2K primary shards), median shard size is about 20Gb, total primaries sum to 35Tb on disk.

warkolm · April 16, 2020, 10:02pm

What version are you running?

With things like (sync) flush it shouldn't take a week to handle that many shards.

klahnakoski · April 17, 2020, 11:53am

The original post was version 6.1.2. My current version is 6.5.4.

During big recoveries, like upgrading half the nodes, the shards would start out recovering quickly and then an hour, or so, slow to a crawl: Down to kilobytes-per-second. I could not get my recovery rate to be any faster, the math would indicate it would take months at this rate. I think I covered this in my other post. The older versions of ES did not have this problem; with version 6 my cluster was always yellow: There were always some replicas that were constantly initializing.

For the longest time I thought the slowness was caused by EC2 network limits between machines: Testing random bytes file transfer between machines looked good, but maybe AWS has a hidden network credit system that allowed bursty transfers, but clamped down on long-term high-volume network activity between machines. A few months ago the lack-of-shard-recovery become dire; I had to figure out what the problem was, and somehow measure properties of this "AWS network clamp" so I could work around it.

So I opened multiple windows to all my nodes, stop all the ingestion, installed network monitoring and watched the network over days while nodes got allocated: The AWS network clamp never appeared. The nodes were transferring data, recovering at just kilobytes per second, but there was still network capacity to do file transfers between nodes. So I try manually requesting shard allocation, one-at-a-time: It was fast. I concluded there must be some pathology when I ask for more than one shard to allocate at a time.

Due to my cluster's particular configuration, all the primaries end up on just a few machines. During recovery, those nodes are doing all the work, and if I ask the cluster to allocate two replicas that have primaries on the same node, then recovery goes extra slow; it take a day, or more, or the allocation just cancels after some number of hours: The replicas move from initializing to unassigned.

I added code to my shard balancer to consider this limitation: Ensuring a node is involved in recovering just one shard at a time. Replication is fast. My cluster is always green.

system · May 15, 2020, 11:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Moving shards is slow Elasticsearch	15	5285	May 10, 2018
Shard re-allocation taking a very long time Elasticsearch	16	7531	April 15, 2019
Increasing shard relocation speed Elasticsearch	7	28697	July 5, 2017
Is it me or is ES 1.6.0 node startup/recovery slower then before? Elasticsearch	15	1078	July 6, 2017
Shard replication/recovery going slow Elasticsearch	2	1090	August 17, 2018

Moving shards is slow - Solution

Related topics