To hopefully add to David's suggestion:
How many snapshots are in that repo?
How many objects are in the bucket/prefix (the one used by this repo)?
How many indices/shards are in that cluster?
Do you force merge to N before doing snapshots, do you use force merge at all?
It would be interesting to see if you would get the same results in a newer version too with the version bump to the AWS SDK and underlying code making the S3 repository work. I believe much work went into this since 5.4 and the aws SDK also went up.
You don't have a dev cluster with 93TB of data by any luck?
5.4.0 is old though, do you have a plan to move up to 5.6 than 6.8?
Did you try setting a huge(maybe huge huge) number in max_retries? Maybe 20 is simply not enough to successfully build the file(blob) list with a such a huge snapshot in the same repo(bucket). You could also try without throttled retries, did you?
But yeah, based on my current knowledge of S3 app architecture, this should be considered an application side bug or mis configuration, in the sense that you should be able to manage PBs in a single S3 bucket with billions of objects and the client code should always be able to retry until it gets done if the errors are transient and not fatal; throttling, 5XX, back pressure, etc.
Even if you reduce your object count by 10 or any other factor, someone else could have a snapshot that's simply bigger by that same factor which would bring him back to the same level.
It's not an S3 problem to have billions of small objects in a bucket/prefix, it's just cost more money than if you'd store the same in less objects. It's use case dependent.
If the s3 repository plugin doesn't retry until it succeed with correct exponential backoff policies when doing atomic operations then you're in a pickle or will be at some point in the future. Even more if you can't set or control further the AWS SDK S3 client that is used under the hood to customize its retry/throttling behavior for your big use case.
Looking at the code this exception would appear(not an expert on that) to come from code that build a representation of what is in the bucket and not the part that actually puts the chunks/blobs which to me make sense since 3500req/s X 100MB/req = 340GB/s. Not saying you can't theoretically push that in S3 and make 3500 PUT of 100MB per second but you definitively can't do that with 40 i3.4xlarge capped at 1.25GB/s of network throughput... IF you are being throttled while PUTing the chunks it's not at 3500req/s of 100MB each for SURE. But I believe S3 can return this at much lower rate while it adjust the S3 backend to scale up to your required request rate for a prefix.
Which would also point to the application side not tolerating the throttling correctly and fatally giving up instead of adjusting. Profiling the request rate would indeed shed light on this part.
Maybe it's Listing/GETing/HEADing to sync it's local representation of the bucket and that's when the throttling triggers enough retries to make the plugin abandon.
Maybe @DavidTurner knows for sure if part of the operation actually consist in building/syncing such a local representation of the bucket content, most S3 apps need to do that but I'm not familiar enough with the s3-repository logic to know for sure.
For your specific question to help in your investigation of the rate produced by the plugin:
- How Do I Configure Request Metrics for an S3 Bucket?
- If you have many things in that same bucket which are unrelated. How Do I Configure a Request Metrics Filter?
- Concept and metric description
- Concept continued and config
Logs approach (then count/aggregate/analyse those logs to derive requests rate):
- Server Access Logging delivered to S3
- Cloudtrail (with the S3 data event logging enabled)
So for cloudtrail it would mean enabling S3 data events than sending cloudtrail to an S3 and then either downloading this for analysis or stuffing it into something like ES or querying it with Athena or something else to derive the metrics you're interested in.
Could also send cloudtrail to cloudwatch logs to derive metrics from logs via either cloudwatch logs metric filter or cloudwatch logs insight.
For server access logging it means enabling them for the bucket and then downloading them/stuffing them in a tool for analysis like ES or Athena or something else.
I presented the options in the order I recommend you try them (from easiest to hardest).
I would also recommend you ask AWS support how you should obtain the numbers they are asking for... Maybe they get can it from their backend without you doing anything... or maybe they have a suggestion I didn't mention/know about?
Other possibilities could involve routing all the request from the s3-repository plugin through a proxy like squid with access logs enabled and then doing the log analysis like above but on those logs...
Maybe ES can be configured to log in debug in a way that this plugin will configure the AWS SDK it uses to log every requests it makes to S3 in an ES log file on each nodes. I don't know if that's possible or how to do it if it is. Require someone else to enlighten you on that. Then, gather and analyse like above.