[Solved] Improving Snapshot Recovery Speed

Hi all,

I have a cluster that I am working with on AWS that is 18TB is size and growing daily. Right now I create daily indices which are backed up to s3. I am looking at consolidating the data from several servers into just a few larger ones and was wondering if others have similar experiences.

During the startup of an entirely new cluster, I am making the following changes (taken from the log):

updating [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] to [50]
updating [indices.recovery.max_bytes_per_sec] from [40mb] to [2gb]
updating [indices.recovery.concurrent_streams] from [3] to [40]

From the _snapshot api:

"daily_backup" : {
    "type" : "s3",
    "settings" : {
      "bucket" : "my-bucket",
      "protocol" : "https",
      "base_path" : "daily",
      "max_restore_bytes_per_sec" : "4096mb",
      "max_snapshot_bytes_per_sec" : "200mb"
    }
  }

I realize that there are probably diminishing returns when setting larger numbers, but I am only seeing about 1GB/min of restoration per machine from s3 backups. The data is being written to raid0 non-EBS drives. Am I possibly missing something that would help speed this up? The AWS servers that I am using to test this must be able to pull s3 data more quickly than that (d2.2xlarge). I will take a look at setting max_num_segments = 1 (and redoing all of my backups) with the hope that this might help overall performance for restoration as well as daily function. Otherwise I would love to hear suggestions. If more information would be helpful, I am happy to oblige.

Note: I made a few changes to snapshot restoration api that allow me to trigger multiple simultaneous snapshot restorations at once. https://github.com/elastic/elasticsearch/pull/12258 (I never touch java so please don't judge what did there too harshly.)

It is always the simpler things isn't it? I have my ES cluster behind a nat in a private subnet on amazon. Changing the nat's instance type to one that supports "high" network performance has quadrupled the speed at least. It looks like that is the only bottleneck.

Maybe try removing restore throttling altogether on the restore (set max_restore_bytes_per_sec to 0)?

This recent issue https://github.com/elastic/elasticsearch/pull/13828 means that ES is throttling much more than you requested.

If you do see a speedup, please report back!

Does that setting interact with indices.recovery.max_bytes_per_sec? Should I set both to zero?

I think recovery throttling is not affected by the above bug (only restoring a snapshot), so you shouldn't need to set indices.recovery.max_bytes_per_sec to 0 (unless you separately want to!).