Hi all,
I have a cluster that I am working with on AWS that is 18TB is size and growing daily. Right now I create daily indices which are backed up to s3. I am looking at consolidating the data from several servers into just a few larger ones and was wondering if others have similar experiences.
During the startup of an entirely new cluster, I am making the following changes (taken from the log):
updating [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] to [50]
updating [indices.recovery.max_bytes_per_sec] from [40mb] to [2gb]
updating [indices.recovery.concurrent_streams] from [3] to [40]
From the _snapshot api:
"daily_backup" : {
"type" : "s3",
"settings" : {
"bucket" : "my-bucket",
"protocol" : "https",
"base_path" : "daily",
"max_restore_bytes_per_sec" : "4096mb",
"max_snapshot_bytes_per_sec" : "200mb"
}
}
I realize that there are probably diminishing returns when setting larger numbers, but I am only seeing about 1GB/min of restoration per machine from s3 backups. The data is being written to raid0 non-EBS drives. Am I possibly missing something that would help speed this up? The AWS servers that I am using to test this must be able to pull s3 data more quickly than that (d2.2xlarge). I will take a look at setting max_num_segments = 1 (and redoing all of my backups) with the hope that this might help overall performance for restoration as well as daily function. Otherwise I would love to hear suggestions. If more information would be helpful, I am happy to oblige.
Note: I made a few changes to snapshot restoration api that allow me to trigger multiple simultaneous snapshot restorations at once. https://github.com/elastic/elasticsearch/pull/12258 (I never touch java so please don't judge what did there too harshly.)