Elasticsearch 6.3.0 shard recovery is slow

Moving from a 2.4.6 to a 6.3.0 cluster we have noticed that shard recovery is a lot slower. On the 2.4.6 cluster recovery time is ~3 minutes for shards, on the 6.3.0 cluster recovery time is ~9-11 minutes. The clusters are identical in topology, JVM, OS, data, index sizes. I am trying to figure out why the shard recovery times are so different between the versions?

Settings on 2.4.6
cluster.routing.allocation.node_initial_primaries_recoveries: 5
cluster.routing.allocation.node_concurrent_recoveries: 5
indices.recovery.max_bytes_per_sec: 400mb
indices.recovery.concurrent_streams: 5

Settings on 6.3.0
cluster.routing.allocation.node_initial_primaries_recoveries: 5
cluster.routing.allocation.node_concurrent_incoming_recoveries: 5
cluster.routing.allocation.node_concurrent_outgoing_recoveries: 5
cluster.routing.allocation.node_concurrent_recoveries: 5
cluster.routing.allocation.cluster_concurrent_rebalance: 5
indices.recovery.max_bytes_per_sec: 400mb

Index Information
1 index 5 shards 1 replica
Shard size is ~9 GB per shard

I have taken a look at the _recovery API and nothing stands out.

"index": {
	"size": {
		"total": "8.9gb",
		"total_in_bytes": 9587011676,
		"reused": "0b",
		"reused_in_bytes": 0,
		"recovered": "8.9gb",
		"recovered_in_bytes": 9587011676,
		"percent": "100.0%"
	},
	"files": {
		"total": 107,
		"reused": 0,
		"recovered": 107,
		"percent": "100.0%"
	},
	"total_time": "9.3m",
	"total_time_in_millis": 560295,
	"source_throttle_time": "0s",
	"source_throttle_time_in_millis": 0,
	"target_throttle_time": "0s",
	"target_throttle_time_in_millis": 0
},
"translog": {
	"recovered": 33530,
	"total": 33530,
	"percent": "100.0%",
	"total_on_start": 33512,
	"total_time": "2.3s",
	"total_time_in_millis": 2335
},
"verify_index": {
	"check_index_time": "0s",
	"check_index_time_in_millis": 0,
	"total_time": "0s",
	"total_time_in_millis": 0
}
1 Like

@ctluce Did you recover a brand new replicas? Do you have a recovery stats on 2.4?

I'm looking at these 3 cases for shard recovery.

  1. Node joining cluster
  2. Node leaving cluster
  3. Index where replicas are increased from 0 to 1

In each of these cases on the 2x and 6x cluster the shard recovery times are around the same for each.

2.4.6 _recovery (picked longest time)

"index": {
	"size": {
		"total": "8.4gb",
		"total_in_bytes": 9097095776,
		"reused": "0b",
		"reused_in_bytes": 0,
		"recovered": "8.4gb",
		"recovered_in_bytes": 9097095776,
		"percent": "100.0%"
	},
	"files": {
		"total": 107,
		"reused": 0,
		"recovered": 107,
		"percent": "100.0%"
	},
	"total_time": "4.1m",
	"total_time_in_millis": 251234,
	"source_throttle_time": "2.8s",
	"source_throttle_time_in_millis": 2813,
	"target_throttle_time": "1.6s",
	"target_throttle_time_in_millis": 1645
},
"translog": {
	"recovered": 2,
	"total": 2,
	"percent": "100.0%",
	"total_on_start": 0,
	"total_time": "863ms",
	"total_time_in_millis": 863
},
"verify_index": {
	"check_index_time": "0s",
	"check_index_time_in_millis": 0,
	"total_time": "0s",
	"total_time_in_millis": 0
}

@ctluce Thanks for the stats.

I have noticed between cluster versions that the network throughput during shard recovery is different. 2x has a higher throughput then 6x, with the settings configured on the cluster and hardware being the same I would think it should be similar. On the 6x cluster I'm seeing shard recovery average ~1GB/minute which lines up with what the recovery API is showing. ~1GB recovery/minute is ~16-17MB/sec which doesn't line up with indices.recovery.max_bytes_per_sec: 400mb

2.4.6 network during shard recovery bytes/sec13%20PM

6.3.0 network during shard recovery bytes/sec

After looking through Elasticsearch documentation again I'm not seeing any additional settings that I'm missing that would affect recovery speed. I have tried setting indices.recovery.max_bytes_per_sec to 1 GB no change in network throughput, no throttling. After looking at RecoverSettings and setting it to a negative byte number then recovery wouldn't have any throttling. If I understand that code right with no rate limiter then it should just send data without ever trying to throttle the recovery. It seems no matter what I set that setting to the network throughput doesn't change and recovery is the same speed.

I have also upgraded the cluster to 6.3.1 in the chances something might change, no difference in shard recovery speed.

One of the differences between 2.x and 6.x is that 2.x sends file chunks in parallel (configured by concurrent_streams), making use of the 2 connections that are established between each node for recovery. Could there be a per-connection bandwidth limit in your network? Note that we've done internal testing, disabling indices recovery throttling and it achieved a throughput of 96MB/s (on a 10 Gbit network).

1 Like

Thanks for the information, I'm looking into that now.

Some details about how the cluster is setup. Its a 6 node cluster with 3 data nodes running on AWS. Instance class is c4.2xlarge split across 3 AZs in a region. Enhanced networking is configured on the instances.

Based on the comment above about internal testing and the 10Gbit network. I did another test this time with an instance class that should be more comparable to yours hopefully. I changed the instance class to a c5.2xlarge, performed tests with recovery throttling disabled. Shard recovery time was the same as I was seeing on the c4.2xlarge instances.

I did the following test to make sure that the instance class supported a higher network throughput then what I'm seeing during shard recovery. Test was performed between 2 instances in different AZs.

[scrubbed@scrubbed scrubbed]# iperf3 -c scrubbed -P 1 -i 1 -t 60 -V
iperf 3.0.12
Linux scrubbed 4.14.55-62.37.amzn1.x86_64 #1 SMP Fri Jul 20 00:44:08 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Time: Thu, 16 Aug 2018 19:13:19 GMT
Connecting to host scrubbed, port 5201
      Cookie: scrubbed
      TCP MSS: 8949 (default)
[  4] local scrubbed port 41886 connected to scrubbed port 5201
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 60 second test
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   591 MBytes  4.95 Gbits/sec    1    839 KBytes       
[  4]   1.00-2.00   sec   589 MBytes  4.94 Gbits/sec   28    830 KBytes       
[  4]   2.00-3.00   sec   586 MBytes  4.92 Gbits/sec   13    787 KBytes       
[  4]   3.00-4.00   sec   591 MBytes  4.96 Gbits/sec   14    411 KBytes       
[  4]   4.00-5.00   sec   585 MBytes  4.91 Gbits/sec   49    804 KBytes       
[  4]   5.00-6.00   sec   584 MBytes  4.90 Gbits/sec   45   1.01 MBytes       
[  4]   6.00-7.00   sec   542 MBytes  4.55 Gbits/sec   85    629 KBytes       
[  4]   7.00-8.00   sec   535 MBytes  4.49 Gbits/sec   78    559 KBytes       
[  4]   8.00-9.00   sec   555 MBytes  4.66 Gbits/sec  124    524 KBytes       
[  4]   9.00-10.00  sec   550 MBytes  4.61 Gbits/sec  124    787 KBytes       
[  4]  10.00-11.00  sec   586 MBytes  4.92 Gbits/sec   52    795 KBytes       
[  4]  11.00-12.00  sec   582 MBytes  4.89 Gbits/sec   42    926 KBytes       
[  4]  12.00-13.00  sec   562 MBytes  4.72 Gbits/sec   55    419 KBytes       
[  4]  13.00-14.00  sec   562 MBytes  4.72 Gbits/sec   63    288 KBytes       
[  4]  14.00-15.00  sec   582 MBytes  4.89 Gbits/sec   71   1.05 MBytes       
[  4]  15.00-16.00  sec   536 MBytes  4.50 Gbits/sec   81    253 KBytes       
[  4]  16.00-17.00  sec   586 MBytes  4.92 Gbits/sec   46   1.01 MBytes       
[  4]  17.00-18.00  sec   552 MBytes  4.63 Gbits/sec   59    988 KBytes       
[  4]  18.00-19.00  sec   579 MBytes  4.85 Gbits/sec   64    446 KBytes       
[  4]  19.00-20.00  sec   551 MBytes  4.62 Gbits/sec   75   1.46 MBytes       
[  4]  20.00-21.00  sec   555 MBytes  4.66 Gbits/sec   80    236 KBytes       
[  4]  21.00-22.00  sec   568 MBytes  4.76 Gbits/sec   61    332 KBytes       
[  4]  22.00-23.00  sec   548 MBytes  4.59 Gbits/sec   68    472 KBytes       
[  4]  23.00-24.00  sec   532 MBytes  4.47 Gbits/sec   80    315 KBytes       
[  4]  24.00-25.00  sec   589 MBytes  4.94 Gbits/sec   47    865 KBytes       
[  4]  25.00-26.00  sec   546 MBytes  4.58 Gbits/sec   68   1.14 MBytes       
[  4]  26.00-27.00  sec   522 MBytes  4.38 Gbits/sec   92   1.51 MBytes       
[  4]  27.00-28.00  sec   555 MBytes  4.66 Gbits/sec   81    271 KBytes       
[  4]  28.00-29.00  sec   516 MBytes  4.33 Gbits/sec  127   1022 KBytes       
[  4]  29.00-30.00  sec   530 MBytes  4.45 Gbits/sec   80   1.48 MBytes       
[  4]  30.00-31.00  sec   579 MBytes  4.85 Gbits/sec   36    393 KBytes       
[  4]  31.00-32.00  sec   590 MBytes  4.95 Gbits/sec   30    428 KBytes       
[  4]  32.00-33.00  sec   594 MBytes  4.98 Gbits/sec   24    970 KBytes       
[  4]  33.00-34.00  sec   566 MBytes  4.75 Gbits/sec   89   1.43 MBytes       
[  4]  34.00-35.00  sec   551 MBytes  4.62 Gbits/sec   64    454 KBytes       
[  4]  35.00-36.00  sec   576 MBytes  4.83 Gbits/sec   40   1.46 MBytes       
[  4]  36.00-37.00  sec   591 MBytes  4.96 Gbits/sec   24    909 KBytes       
[  4]  37.00-38.00  sec   540 MBytes  4.53 Gbits/sec   68   1005 KBytes       
[  4]  38.00-39.00  sec   591 MBytes  4.96 Gbits/sec   23   1.02 MBytes       
[  4]  39.00-40.00  sec   591 MBytes  4.96 Gbits/sec   11    804 KBytes       
[  4]  40.00-41.00  sec   591 MBytes  4.96 Gbits/sec   30    428 KBytes       
[  4]  41.00-42.00  sec   559 MBytes  4.69 Gbits/sec  100    454 KBytes       
[  4]  42.00-43.00  sec   500 MBytes  4.19 Gbits/sec  108    297 KBytes       
[  4]  43.00-44.00  sec   554 MBytes  4.65 Gbits/sec   84    149 KBytes       
[  4]  44.00-45.00  sec   558 MBytes  4.68 Gbits/sec   81    778 KBytes       
[  4]  45.00-46.00  sec   480 MBytes  4.03 Gbits/sec  130    210 KBytes       
[  4]  46.00-47.00  sec   461 MBytes  3.87 Gbits/sec  126    489 KBytes       
[  4]  47.00-48.00  sec   471 MBytes  3.95 Gbits/sec  129    717 KBytes       
[  4]  48.00-49.00  sec   536 MBytes  4.50 Gbits/sec   86    437 KBytes       
[  4]  49.00-50.00  sec   559 MBytes  4.69 Gbits/sec   68   1.43 MBytes       
[  4]  50.00-51.00  sec   590 MBytes  4.95 Gbits/sec   31    760 KBytes       
[  4]  51.00-52.00  sec   588 MBytes  4.93 Gbits/sec   29   1.48 MBytes       
[  4]  52.00-53.00  sec   591 MBytes  4.96 Gbits/sec   44    918 KBytes       
[  4]  53.00-54.00  sec   575 MBytes  4.82 Gbits/sec   62    620 KBytes       
[  4]  54.00-55.00  sec   554 MBytes  4.65 Gbits/sec   90   1.48 MBytes       
[  4]  55.00-56.00  sec   549 MBytes  4.60 Gbits/sec   98    577 KBytes       
[  4]  56.00-57.00  sec   581 MBytes  4.88 Gbits/sec   23    856 KBytes       
[  4]  57.00-58.00  sec   550 MBytes  4.61 Gbits/sec   49   1.20 MBytes       
[  4]  58.00-59.00  sec   564 MBytes  4.73 Gbits/sec   55   1.46 MBytes       
[  4]  59.00-60.00  sec   548 MBytes  4.59 Gbits/sec   73    568 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.00  sec  32.8 GBytes  4.69 Gbits/sec  3858             sender
[  4]   0.00-60.00  sec  32.8 GBytes  4.69 Gbits/sec                  receiver
CPU Utilization: local/sender 5.2% (0.1%u/5.1%s), remote/receiver 14.5% (1.0%u/13.5%s)

iperf Done.

I know this test is not a 1 to 1 with what Elasticsearch does during shard recovery. I do think it shows that a bottleneck doesn't exist between the instances and no network bandwidth limit per connection.

Can you provide me with settings that were enabled/configured during test that had the much higher shard recovery throughput? or any other information you think that might help.

  • JVM settings
  • OS settings/changes
  • Elasticsearch settings
  • Shard size
  • What is the test hardware specs that achieved 96MB/s

Looking at instance metrics: network throughput, disk read/write speeds/latency, nothing is standing out as a bottle neck

Can I provide any additional information (settings/logs/etc) to help figure out what is going on?

Pinging @danielmitterdorfer here, as he ran the benchmark.

Hi,

we did this benchmark on a three node Elasticsearch cluster on version 6.3.0 and each node had 4GB heap. You can find the detailed hardware and software specs at https://elasticsearch-benchmarks.elastic.co/. The index size was roughly 76GB and we achieved a median replication throughput of 97.5 MB/s (minimum 78MB/s, maximum 101MB/s). I have provided the original benchmark at https://gist.github.com/danielmitterdorfer/69359c521a85bfeb7f8cbc46cfb8072c.

Daniel

Thanks for all the info, after some more experimentation have made some progress on this.

The metrics for our GP2 EBS volume never showed that we fully utilized the throughput or was even putting load on it. On a wild guess I changed our EBS volume to an IO1 and shard recovery time speed up again. This is a bit concerning that between a 2x and 6x cluster to have the same behavior on the cluster we have to upgrade hardware. This will also cost significantly more money to attain the same performance we had on previous versions.

I believe the issue we are having is the concurrency around shard recovery has changed in 6x and we are no longer able to fully utilize the hardware without the ability to configure concurrent_streams.

Any other suggestions/settings to try to see if recovery can be improved without needing to change out hardware? Other option is to move to instance storage instead of using EBS.

I'm sure this is a long shot but figure worth asking. Can the shard recovery settings/concurrency model that are available in previous versions be added back in to current/future version to allow commodity hardware to fully be utilized without needing to upgrade it for same performance?

I'm currently migrating from a 1.7.x cluster to 6.3.2 and ran into this exact same problem. No matter what i set the indices.recovery.max_bytes_per_sec to, the max throughput is about 11 MB/s. This is a large cluster, with 10Gb ethernet, flash storage, plenty of RAM, CPU, etc so I was equally stumped. Happens I noticed that our previous elasticsearch.yml file had the transport.tcp.compress set to true. I didn't really pay attention to this but i should have. Changing this to false (to disable compression) suddenly results in the bandwidth and recovery performance in line with the indices.recovery.max_bytes_per_sec setting. Is it possible your nodes currently have this set?

2 Likes

Setting transport.tcp.compress to false fixed the issue, lesson learned reevaluate previous cluster settings on new versions. Shard recovery time/speed now match what I configure in indices.recovery.max_bytes_per_sec

Thanks for all the replies and help figuring out this issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.