Shards Taking a Long Time to Move Between Nodes - Cloud [7.1.1]

We are running a 17 node cluster with 10 dedicated hot nodes, 3 master nodes and 4 dedicated warm nodes.

We have an index with 5 primary shards and a replication factor of 1 that writes only to the hot nodes, which is dictated by a template.

We have implemented a lifecycle policy that when there is 50Gb of storage in the primary shards, it rolls over to a shrink index reducing the shard count to 1 primary and replica shard. This index is on the warm nodes.

Before it is shrunk, it moves the index onto the warm node, reallocates 1 of each shard to a single node then begins the shrink process.

This all works, but the issue is we have a high ingest rate ~20k second on the hot nodes and the recovery speed is only 40Mb/s per recovery happening on a node, which is observed by running nload -u M on the nodes. This causes the cluster to struggle as it's not finishing the previous steps before the new shrink index is getting created and the index getting moved to the warm nodes.

An example would be:

warm-0 -> warm-1
warm-2 -> warm-1

warm-1 would have an incoming rate of ~80Mb/s

warm-3 -> warm-4

warm-4 would have an incoming rate of ~40Mb/s

Even when both of these are the same size VM. The first example suggests that the VM can handle at least 80Mb/s - however, this is unattainable on the warm-4.

Our cluster configuration settings are as follows:

{
  "persistent" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "cluster_concurrent_rebalance" : "2",
          "node_concurrent_recoveries" : "2"
         }
      }
    },
    "indices" : {
      "recovery" : {
        "max_bytes_per_sec" : "1024mb",
        "max_concurrent_file_chunks" : "5"
      }
    }
  }
}

Our elasticsearch.yml is as follows:

cluster.name: "cluster"
node.name: "warm-2"
path.logs: /var/log/elasticsearch
path.data: /datadisks/disk1/elasticsearch/data
discovery.seed_hosts: ["master-0","master-1","master-2"]
cluster.initial_master_nodes: ["master-0","master-1","master-2"]
node.master: false
node.data: true
network.host: [_site_, _local_]
node.max_local_storage_nodes: 1
node.attr.fault_domain: 0
node.attr.update_domain: 12
node.attr.box_type: warm
cluster.routing.allocation.awareness.attributes: fault_domain,update_domain
xpack.license.self_generated.type: basic
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/elastic-certificates.p12

An image of one the nodes pushing a shard into another:

Hi @ljefferies and welcome!

40MB/s (I assume you mean megabytes not megabits) is a suspicious number because that's the default for the rate limit on recoveries.

Does the indices recovery API report that any throttling is taking place on these recoveries? We're looking for nonzero values for source_throttle_time_in_millis or target_throttle_time_in_millis.

Hi

Yeah I mean't Megabytes.

The indices_recovery API on one of the indices currently getting moved shows:

"total_time" : "218ms",
"total_time_in_millis" : 218,
"source_throttle_time" : "-1",
"source_throttle_time_in_millis" : 0,
"target_throttle_time" : "-1",
"target_throttle_time_in_millis" : 0

I have tried to set indicies.recovery.max_bytes_per_sec to -1.

This was after seeing this code in the ElasticSearch Repository

this.maxBytesPerSec = INDICES_RECOVERY_MAX_BYTES_PER_SEC_SETTING.get(settings);
    if (maxBytesPerSec.getBytes() <= 0) {
        rateLimiter = null;
    } else {
        rateLimiter = new SimpleRateLimiter(maxBytesPerSec.getMbFrac());
    }

Ok, it's not the rate-limiter that's limiting the traffic. I wonder if it's CPU-bound. What does the CPU usage look like on both the sending and receiving node? Are you using TLS? Are there any other settings in elasticsearch.yml, particularly any to do with networking (e.g. compression?). Can you share the output of GET _nodes/hot_threads?threads=999999?

Does changing indices.max_concurrent_file_chunks have any effect on the recovery speed?

The CPU on warm-3 was maxed at ~16%


This particular node was moving 1 shard and receiving 1 shard.

We are using TLS, this is as configured shown in the initial post elasticsearch.yml

Those are all the settings we use on all of our nodes, only changing the node name or attributes when required.

No compression setting have been activated, I checked this through the GET /_cluster/settings?include_defaults API
image

I checked the GET _nodes/hot_threads?threads=999999

It is too big to send to you on this topic, do you have an email address?

Some part of it is this if it helps?

2.8% (14.2ms out of 500ms) cpu usage by thread 'elasticsearch[warm-2][generic][T#4]'

Changing indices.max_concurrent_file_chunks doesn't make any difference to the speed at which it recovers.

eth0 still remains at 80MB/s when configured - "indices.recovery.max_concurrent_file_chunks": "2"

Sure, I'm david.turner@elastic.co. Let me know if anything needs redacting (node names/IP addresses/etc).

Is there anything in particular that you want me to send to you?

As I'm unable to send you the entire log.

Hmm, tricky to say without seeing it :slight_smile: Basically we're looking for the bottleneck.

I was particularly interested to see if there were any threads that were very busy doing network-related or recovery-related things, because this'd tell us that we're CPU-bound and would also tell us whether the bound is on the sending or receiving side.

I'd also check something like netstat to see if there's a queue building up at the TCP level.

Some further thoughts after discussions with colleagues:

When you say you've "implemented a lifecycle policy" do you mean you're using ILM?

How long is it taking you to build this index up to 5 shards * 50GB == 250GB? Even at 40MB/s it should take a little under 2h to relocate this much data.

Why do you only have 5 shards when you have 10 hot nodes? Do you have 1 hot shard per hot node?

When you're relocating the shards onto a single warm node for shrinking, is this happening directly from the hot nodes in a single step or is it relocating all the shards across all the warm nodes and then moving them again for shrinking? I note that you're asking about recoveries between warm nodes, which suggests that it's not a direct transfer.

Given that each shrink action doesn't have time to complete before the next one starting, why is the cluster struggling? Can you clarify what you mean here?

Yeah, we have created an ILM with a 50GB rollover.

10 shards in total, 5 primary and 5 replica shards, one of each shard on each of the hot nodes.

So what happens is when it gets to 50GB of primary storage on the hot nodes, it then reallocated all 10 shards to the warm nodes.

So roughly there are 10, 10GB shards getting moved to the warm nodes.

Once all the shards are off the hot nodes and into the warm. It will then begin to allocate one of each shard, either primary or replica all onto 1 node. So data-node-warm-0 could be given 5 shards. Now the shrink to 1 primary and replica shard begins, which is also done at 40MB/s. This takes roughly 25 mins.

It can struggle, as the ingest is rate is ~20k/s on the hot nodes, so it can potentially get in a situation where the hot index is gets reallocated to the warm nodes before the warm nodes have finished dealing with the reallocation of the shards inside of the warm nodes.

As a side note, it also tries to rebalance the nodes already on the warm nodes as when it begins to move all the shards to one node for shrinking, it creates an imbalance and then tries to reallocate some of the other shards which are now 50GB each a piece as they are 1 shard. This, with our 40MB/s reallocation speed, means this also takes 25 mins.

I hope that made sense.

Thanks

Yes, I'm still trying to understand what the "struggle" is here. What you describe seems fine to me.

Mainly, 40MB/s seems too slow to reallocate the shards.

This could cause problems where the previous shrink index hasn't finished before the next index is moved to the warm nodes, thus it's slowly accumulating more and more shards that are piling up on the warm nodes, because the reallocation and shrinking is taking too long.

I asked some colleagues about ILM's behaviour here and learned that it's ok to have these reallocate-shrink-reallocate processes happening concurrently and that this is not in itself a sign of a struggling cluster. Is the backlog really growing unboundedly rather than reaching an equilibrium? I'd like to understand a bit more about the timing of things too - each 50GB index needs to be reallocated twice and each takes ~25 minutes, but how often are you ingesting another 50GB?

I'm still interested in where the bottleneck is. The latest figures I have for examples of recovery speeds are in #36981 which introduced the max_concurrent_file_chunks setting. With this set to 4, TLS enabled and compression disabled these benchmarks were many times faster than what you're seeing in your cluster. There's a brief description of the machines used in that benchmark there. How do they compare to your setup?

Have you measured the latency and bandwidth between these nodes via other means? If the process were latency-bound I'd expect changing the number of concurrent chunks to have an impact, but you report it isn't. Still I think this is worth checking.

It can be anywhere between 15 minutes and 1 hour before it fills up 50GB and needs to be rolled over. If we could achieve the same speed as in #36981 then we'd be able to transfer it all in ~10 minutes.

The setup used is similar to theirs, the only difference is we use 16GB of RAM instead and have a 3.5Gbps connection.

This is measuring the bandwidth between 2 warm nodes.

I have done a test this morning to see how fast the disk can write, it shows that it can handle up to 190MBps - Still significantly higher than the ~40MBps that we are currently getting.

This is also a lot more close to the values that were seen on #36981

sudo sync; sudo dd if=/dev/zero of=/datadisks/disk1/tempfile bs=250M count=20; sudo sync
20+0 records in
20+0 records out
5242880000 bytes (5.2 GB, 4.9 GiB) copied, 27.4052 s, 191 MB/s

I'm not sure this is a robust benchmark - you wrote 5GB of data on a machine with 16GB of RAM, but the time reported didn't include the subsequent sync, so you might not have been counting the time taken to flush any cached data to disk. Still, I suspect the slow bit isn't writing to disk, because you've seen nodes with 80MBps incoming.

We might get a further clue from looking at the pattern of timing of the transport messages. I should warn you that during a recovery there's quite a lot of traffic, so these instructions will generate quite voluminous logs. You can enable transport logging of recovery messages like this:

PUT _cluster/settings
{
  "persistent" : {
    "transport.tracer.include": "internal:index/shard/recovery/*",
    "logger.org.elasticsearch.transport.TransportService.tracer": "TRACE"
  }
}

After a few seconds, you can disable them again like this:

PUT _cluster/settings
{
  "persistent" : {
    "transport.tracer.include": null,
    "logger.org.elasticsearch.transport.TransportService.tracer": null
  }
}

Then in the logs on the node holding the primary you should see messages like this:

[2019-06-25T10:48:15,475][TRACE][o.e.t.T.tracer           ] [node-0] [491][internal:index/shard/recovery/file_chunk] sent to [{node-2}{7BBRwJBYRvu2pvarFIG8dg}{2kRuSx2OTZm_7O0zbwryVg}{127.0.0.1}{127.0.0.1:9302}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true}] (timeout: [15m])
[2019-06-25T10:48:15,483][TRACE][o.e.t.T.tracer           ] [node-0] [491][internal:index/shard/recovery/file_chunk] received response from [{node-2}{7BBRwJBYRvu2pvarFIG8dg}{2kRuSx2OTZm_7O0zbwryVg}{127.0.0.1}{127.0.0.1:9302}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true}]

On the node holding the recovering replica they'll show up like this:

[2019-06-25T10:48:15,475][TRACE][o.e.t.T.tracer           ] [node-2] [491][internal:index/shard/recovery/file_chunk] received request
[2019-06-25T10:48:15,483][TRACE][o.e.t.T.tracer           ] [node-2] [491][internal:index/shard/recovery/file_chunk] sent response

Note the correlation in message ID (i.e. [491] in the example above). It'd be good to analyse the flow of these messages and their timestamps to see if the slow bit is on the sending side or the receiving side or the network in between.

I agree the 5GB was not robust, however, I initially ran it with 50GB and its performed pretty much exactly the same. I think it was 193MBps.

I have looked at the logs and there is a 2ms delay between it getting sent to the other and the other node receiving it.

[2019-06-25T10:04:21,650][TRACE][o.e.t.T.tracer           ] [warm-2] [8899779][internal:index/shard/recovery/file_chunk] sent to [{warm-3}{xyz}***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}] (timeout: [15m])
[2019-06-25T10:04:21,652][TRACE][o.e.t.T.tracer           ] [warm-3] [8899779][internal:index/shard/recovery/file_chunk] received request

It would seem that this unlikely to be the issue?

Sure, but there's a lot more information in these logs than that one data point. Is the receiver spending a long time between receiving the chunk and acking it? Is the sender spending a long time between receiving an ack for one chunk and sending the next? Basically, where is the time going? These chunks should be 512kB in size, to give you a sense of how this ties to the recovery speed.

What JDK are you using, by the way?

So the first 10 records are:

[2019-06-25T10:04:21,650][TRACE][o.e.t.T.tracer           ] [warm-2] [8899779][internal:index/shard/recovery/file_chunk] sent to [{warm-3}{xyz}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}] (timeout: [15m])
[2019-06-25T10:04:21,654][TRACE][o.e.t.T.tracer           ] [warm-2] [8899779][internal:index/shard/recovery/file_chunk] received response from [{warm-3}{xyz}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}]
[2019-06-25T10:04:21,662][TRACE][o.e.t.T.tracer           ] [warm-2] [8899780][internal:index/shard/recovery/file_chunk] sent to [{warm-3}{xyz}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}] (timeout: [15m])
[2019-06-25T10:04:21,665][TRACE][o.e.t.T.tracer           ] [warm-2] [8899780][internal:index/shard/recovery/file_chunk] received response from [{warm-3}{xyz}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}]
[2019-06-25T10:04:21,673][TRACE][o.e.t.T.tracer           ] [warm-2] [8899781][internal:index/shard/recovery/file_chunk] sent to [{warm-3}{xyz}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}] (timeout: [15m])
[2019-06-25T10:04:21,676][TRACE][o.e.t.T.tracer           ] [warm-2] [8899781][internal:index/shard/recovery/file_chunk] received response from [{warm-3}{xyz}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}]
[2019-06-25T10:04:21,683][TRACE][o.e.t.T.tracer           ] [warm-2] [8899782][internal:index/shard/recovery/file_chunk] sent to [{warm-3}{xyz}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}] (timeout: [15m])
[2019-06-25T10:04:21,686][TRACE][o.e.t.T.tracer           ] [warm-2] [8899782][internal:index/shard/recovery/file_chunk] received response from [{warm-3}{xyz}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}]
[2019-06-25T10:04:21,692][TRACE][o.e.t.T.tracer           ] [warm-2] [8899783][internal:index/shard/recovery/file_chunk] sent to [{warm-3}{xyz}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}] (timeout: [15m])
[2019-06-25T10:04:21,695][TRACE][o.e.t.T.tracer           ] [warm-2] [8899783][internal:index/shard/recovery/file_chunk] received response from [{warm-3}{xyz}{***.***.***.***}{***.***.***.***:9300}{ml.machine_memory=16794214400, update_domain=13, ml.max_open_jobs=20, fault_domain=1, xpack.installed=true, box_type=warm}]

This looks like there is between a 10-12 ms delay before sending the next packet. So it sends roughly 100 packets of 256KB a second?

That would translate to about 25MBps, which seems closer to what we are achieving.