Shard Relocation Speed

I'm trying to get a good grasp on how long I can expect for a replica shard to take to transfer between nodes.

Running ES 6.2 on Windows, I ran a few tests.

First I used the "/_cluster/reroute" API to get a good baseline of how long it took to move a shard from one node to another. I picked an index that was no longer being written to, selected a replica shard on a particular node, and rerouted it to another node (I made sure there were no other shards moving around at the same time). When I did this, 8gb shards were consistently taking about 8 minutes. I ran the recovery api to monitor the relocation, and here's a snapshot of that, mid transfer:

"indexname": {
    "shards": [
      {
        "id": 61,
        "type": "PEER",
        "stage": "INDEX",
        "primary": false,
        "start_time_in_millis": 1528993584252,
        "total_time_in_millis": 384957,
        "source": {
          "id": "IeDiGU9LStCvP0rqzwhLJg",
          "host": "shost",
          "transport_address": "1.1.1.1:9300",
          "ip": "1.1.1.1",
          "name": "shost"
        },
        "target": {
          "id": "vH5lZXG7T_SargmNTQc6xQ",
          "host": "thost",
          "transport_address": "2.2.2.2:9300",
          "ip": "2.2.2.2",
          "name": "thost"
        },
        "index": {
          "size": {
            "total_in_bytes": 8268932538,
            "reused_in_bytes": 0,
            "recovered_in_bytes": 6892110768,
            "percent": "83.3%"
          },
          "files": {
            "total": 118,
            "reused": 0,
            "recovered": 117,
            "percent": "99.2%"
          },
          "total_time_in_millis": 384939,
          "source_throttle_time_in_millis": 0,
          "target_throttle_time_in_millis": 0
        },
        "translog": {
          "recovered": 0,
          "total": 0,
          "percent": "100.0%",
          "total_on_start": 0,
          "total_time_in_millis": 0
        },
        "verify_index": {
          "check_index_time_in_millis": 0,
          "total_time_in_millis": 0
        }
      }

I'm not seeing any throttling, everything appears to go smoothly (though slowly). 99% of the time is spent on the last few files, but that makes sense. When I move a smaller shard, it sticks close to that 1gb/minute ratio.

However, I know that our hardware can handle much faster data transfer (fast network, SSD drives). To get an idea of the fastest our hardware could handle I manually copied data between the machines and that same 8gbs took under 30 seconds.

I do expect there to be some overhead with ES, but this seems extreme.

To get to the point where we are throttled by the "indices.recovery.max_bytes_per_sec" setting I have to bring it down below the speeds we are seeing, down below ~20mb. Anything higher than that has no effect so there is something else slowing it down. I believe that setting is the most obvious first step, but I think I have eliminated it as a factor.

I'm hoping that there are some ES settings that I have not found yet. I've searched around for some, but most are settings that have since been deprecated. Any suggestions would be appreciated.

As we are built today, overall cluster performance is... okay (though it would be vastly improved if we could make better use of our hardware). We expect groups of nodes to go down on a semi-regular basis, and when they come back it seems to take about 30 minutes to transfer over it's share of shards, which is not quite ideal. What's more: on average our shards are small compared to what I read about other people's clusters. I would really like to decrease the shard counts to have bigger shards, but I'm worried about what would happen to our stability if individual shards started taking 20 minutes to move between nodes (I know about increasing the concurrency settings, but that will not improve the minimum recovery time).

As a secondary question: Does the "detailed" parameter for the _recovery API just not work in windows? It does not have any effect for me.

I can't answer your OP but I can say that it looks like ?detailed has no effect on a GET _recovery query on any platform. This is #28910.

Even though the APIs are claiming that no throttling is happening, it really feels like there is some throttling in place. It seems to get right up to 25mb/s (disk I/O) but never above.

25mb also happens to be the default for the since deprecated "indices.store.throttle.max_bytes_per_sec" setting. Does anyone know enough about that change to tell me if it could make sense for there to be some correlation there?

Edit: Actually, I was looking at network read/write stats that were topping off at 25mb/sec. Still looks like throttling, but that store setting would be unrelated. I've not found any network throttling setting in ES though.

One more thing that might be worth mentioning. This 25mb/sec network limit appears to be per request. If I send two shards from two different nodes, to a single node, the target node takes on twice the network traffic/disk input as it did when one shard was being sent its way.

If I can't find a way around this limitation, for the sake of rebalancing, I think our ideal cluster would be a high number of small shards. Possibly keeping every shard to around 1gb, even though that would mean adding tens of thousands of shards to our already 5-digit shard count.... Which is actually the opposite direction I wanted to take the cluster. I was hoping to reduce the number of primaries and increase the replica count.

Have you tried increasing the "indices.recovery.max_bytes_per_sec" setting and checked if that helps? We cannot reproduce this behavior in our test environment. On a 10 Gbit network, we got 96MB/s throughput for moving a single shard from one node to another.

For moving a single shard, that setting does not have an effect simply because our throughput is so low to start with, that I have to make this setting even lower to see a change (that is to say: We are seeing ~20mb/s, if I set max_bytes_per_sec to anything lower than that, I see throttling happen).

Granted, there is still value in having "indices.recovery.max_bytes_per_sec" set much higher than that, because it is only individual actions/threads that appear to be locked to 20mb/s. When multiple shards are being sent to a node at once (from multiple nodes), we have no trouble reaching 200mb/s (or higher) incoming traffic on a single node, so long as "indices.recovery.max_bytes_per_sec" is set high enough to allow it.

On a 10 Gbit network, we got 96MB/s throughput for moving a single shard from one node to another.

That's pretty interesting.... is your test environment running on Windows? 96MB/s is several times faster than what I am seeing, so I doubt it could come down to a difference in how we are collecting our performance metrics.

Are there any other factors that we could reasonably assume might be contributing to our poor performance (I would be more than happy to test out a few different scenarios)? Doc size, segment count (which I'm sure would have some effect, however I've already tried force merging and saw no noticeable change), cluster size, templates, shard size... overall shard count? Or any other configuration knob we could turn, even just as a long shot.

Unfortunately this is the single factor pushing us to want smaller shards. In every other aspect we would prefer fewer/larger shards.

Increasing individual shard transfer speeds up to what you are seeing would make a world of difference for us. As we are now, recoveries are a bit scary, high watermark measures are almost never fast enough to actually prevent the flood watermark from being hit, balancing on one of our big shards being written to can take hours, and actions like shrink or reindex are pretty much off the table even during off-hours.

Are there any other factors that we could reasonably assume might be contributing to our poor performance (I would be more than happy to test out a few different scenarios)? Doc size, segment count (which I'm sure would have some effect, however I've already tried force merging and saw no noticeable change), cluster size, templates, shard size... overall shard count? Or any other configuration knob we could turn, even just as a long shot.

Segment count and size is the only thing that could possibly affect this. Segment files are sent sequentially over the wire (one request per file <= 512KB, o.w. chunked into 512KB bundles, and we start with the smallest file first). I cannot see any other throttling being done by the system other than through indices.recovery.max_bytes_per_sec. The implementation mostly reads the files from disk, sends them over the wire, checksums the files and writes them to disk.

is your test environment running on Windows?

No, we've not tested this on Windows. Can you perhaps check if you can reproduce your findings in a different environment?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.