Snapshot to AWS S3 fails ES 1.5.2 AWS cloud plugin 2.5.1

Hey,
I'm running daily backups of my ES cluster (using AWS Data Pipeline).
Every now and then, my backup fails due to AmazonS3Exception status code 500 (see below).
When it fails, my scheduler (i.e Data Pipeline) automatically deletes the PARTIAL snapshot and re-creates the snapshot.
However, each retry takes hours, and I'm trying to shorten this process.

I've contacted AWS support, their saying that :

  1. They need the request ID AND the extended request ID, whereas the AWS cloud plugin logs only the request ID.
  2. 5xx errors are actually to be expected as part of normal interaction with the S3 service.

Has anyone encountered such errors?
Alternatively, is there a way to get the extended request ID from the AWS cloud plugin?

Thanks!

The output from the snapshot CURL :
snapshots [ {
snapshot mysnapshot-08-05-2016
indices [ aaa bbb ccc ddd ]
state PARTIAL
start_time 2016-05-08T102226.980Z
start_time_in_millis 1462702946980
end_time 2016-05-08T133645.543Z
end_time_in_millis 1462714605543
duration_in_millis 11658563
failures [ {
node_id xxxyyyzzz
index aaa
reason IndexShardSnapshotFailedException[[aaa][18] We encountered an internal error. Please try again. (Service Amazon S3; Status Code 500; Error Code InternalError; Request ID 3EA6454E835F5977)]; nested AmazonS3Exception[We encountered an internal error. Please try again. (Service Amazon S3; Status Code 500; Error Code InternalError; Request ID 3EA6454E835F5977)];
shard_id 18
status INTERNAL_SERVER_ERROR
} ]
shards {
total XX
failed 1
successful YY
}
} ]

My repo settings are :
{
"my-repo": {
"type": "s3",
"settings": {
"bucket": "my-repo",
"max_restore_bytes_per_sec": "8000mb",
"max_snapshot_bytes_per_sec": "8000mb"
}
}
}

If anyone's interested, seems like the reason for these failures is an excessive PUT request rate to S3.
Since the default chunk_size is 100mb and buffer_size is 5mb (see https://github.com/elastic/elasticsearch-cloud-aws/tree/es-1.5), it causes each file to be broken into many parts (uploaded using the AWS multipart upload API), which in turn results in excessive request rate.
After discussing it with AWS support and reviewing a similar issue in GitHub (https://github.com/elastic/elasticsearch/issues/17244), I’ve changed my repo settings to :
{
“my-repo": {
"type": "s3",
"settings": {
"bucket": “my-repo",
"chunk_size": "1gb",
"max_restore_bytes_per_sec": "8000mb",
"max_retries": "30",
"buffer_size": "100mb",
"max_snapshot_bytes_per_sec": "8000mb"
}
}
}

Increasing the buffer_size and chunk_size seem to solve the problem (at least for now).

I've been attempting to tweak settings for various snapshots but with limited success...

These are my settings:-

{
"type": "s3",
"settings": {
"bucket": "monitoring",
"protocol": "http",
"chunk_size": "1gb",
"max_restore_bytes_per_sec": "8000mb",
"max_retries": "30",
"buffer_size": "100mb",
"max_snapshot_bytes_per_sec": "8000mb"
}
}

I'm successfully making snapshots of indices up to a size of ~3.5Gb (biggest so far is 3.31Gb, 3.75Gb fails) using the above settings

The S3 service is an internal S3 compliant service so we can potentially tweak some settings there as well but as yet don't have access to that.

I'm going to need to snapshot indices up to about 8Gb in size so wondered if anyone could recommend settings which might work for this?

If we can't get this working then will probably have to use shared filesystem based snapshots which is far from ideal?

Hey,
As you can see from my comment above, the settings you use are similar to mine (except the protocol, which we don't set and I think the default is https).
With those settings, I'm able to take snapshots of indices of up to a few TBs each.

Just to verify:

  1. What errors are you seeing?
  2. Are you using the same versions mentioned above (ES 1.5.2 and plugin 2.5.1)?
  3. What's the average shard size in those indices?

Hi,

  1. Yes I'm using ES 1.5.2 and plugin 2.5.1

  2. We have 5 shards per index and indices are between 2Gb to 10Gb so 400Mb to 2Gb (I guess?)

  3. When the snapshot fails it always gives us an error message as follows listed below.

I am kicking these snapshots off through kopf and it shows IN_PROGRESS for a while and then goes to PARTIAL

[2016-10-07 14:28:31,387][WARN ][snapshots ] [es-pa002] [[ethan-pa-raw-2016-05-01][1]] [s3:ethan-pa-raw-2016-05-01] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: [ethan-pa-raw-2016-05-01][1] Invalid Argument (Service: Amazon S3; Status Code: 400; Error Code: InvalidArgument; Request ID: 626dd09c-9356-4f3a-b389-a6f6e66526c7)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.snapshot(BlobStoreIndexShardRepository.java:150)
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:85)
at org.elasticsearch.snapshots.SnapshotsService$5.run(SnapshotsService.java:817)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Invalid Argument (Service: Amazon S3; Status Code: 400; Error Code: InvalidArgument; Request ID: 626dd09c-9356-4f3a-b389-a6f6e66526c7), S3 Extended Request ID: null
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1127)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:743)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:462)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:297)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3672)
at com.amazonaws.services.s3.AmazonS3Client.doUploadPart(AmazonS3Client.java:2808)
at com.amazonaws.services.s3.AmazonS3Client.uploadPart(AmazonS3Client.java:2793)
at org.elasticsearch.cloud.aws.blobstore.DefaultS3OutputStream.doUploadMultipart(DefaultS3OutputStream.java:215)
at org.elasticsearch.cloud.aws.blobstore.DefaultS3OutputStream.uploadMultipart(DefaultS3OutputStream.java:187)
at org.elasticsearch.cloud.aws.blobstore.DefaultS3OutputStream.flush(DefaultS3OutputStream.java:83)
at org.elasticsearch.cloud.aws.blobstore.S3OutputStream.flushBuffer(S3OutputStream.java:71)
at org.elasticsearch.cloud.aws.blobstore.S3OutputStream.write(S3OutputStream.java:79)
at java.io.OutputStream.write(Unknown Source)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshotFile(BlobStoreIndexShardRepository.java:557)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshot(BlobStoreIndexShardRepository.java:500)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.snapshot(BlobStoreIndexShardRepository.java:140)
... 5 more
[2016-10-07 14:28:36,217][INFO ][snapshots ] [es-pa002] snapshot [s3:ethan-pa-raw-2016-05-01] is done

I see.
That's not a lot of data so taking a snapshot shouldn't be a problem in theory.
However, what you've mentioned looks a bit different from the problem we had, since we got "Status Code 500; Error Code InternalError", and you're getting "Status Code: 400; Error Code: InvalidArgument".
Have you tried contacting AWS support?

Hi,

I think my problem relates back to the earlier comments you made regarding the 'Extended ID'?

  1. They need the request ID AND the extended request ID, whereas the AWS cloud plugin logs only the request ID.

Has anyone encountered such errors?
Alternatively, is there a way to get the extended request ID from the AWS cloud plugin?

We are not using an actual S3 service but an internal S3 equivalent provided by CleverSafe.

Do you think that maybe the earlier version of the AWS Cloud plugin just does not support this 'Extended ID' mechanism and that this is required when files get bigger than some trivial size?

I'm not really sure who I can ask for this level of detailed analysis so just wondered if anyone else had encountered the issue?

I've also fired off a request to CleverSafe to see if they can shed any light on the issue.

Hey,
The plugin, at least in the version we're using, doesn't display the extended ID, so AWS support were unable to locate the specific request in their logs, but they were able to give me an idea of possible root cause.
I'm not sure what are possible root causes for the error you're getting (with status code 400), but perhaps your service provider (CleverSafe) can help.
Sorry I can't help you more...

May be you need to use a specific signer? https://www.elastic.co/guide/en/elasticsearch/plugins/current/cloud-aws-usage.html#cloud-aws-usage-signer

IIRC this option does not work for old versions so may be you have to upgrade to 1.7 first.
Just run a test from another test cluster.

We're upgrading the cluster to ES1.7 + AWS2.7.1 to see if that improves things for now...

The plan is to be at ES2.4 soon enough but we need to backup the cluster to S3 first hence the 1.5 -> 1.7 step for now.

We have upgraded some of the nodes to 1.7 but the AWS Cloud plugin still fails but maybe when they're all at ES1.7 it will be better as all will then have AWS2.7.1.

In terms of signing, we only use the HTTP (not HTTPs) protocol so that should not be an issue should it?

Signer is for AWS API versions.
It depends on your provider basically. If they are compatible with the latest AWS API or are still using an old one.

OK thanks, our provider is IBM CleverSafe so I'm sure they're using the latest AWS API spec...

Do you know if all nodes that are involved in a snapshot (I'm assuming all nodes which contains shards involved in the snapshot) need to be at the appropriate level.

I assume that the snapshot is initiated by a particular node but the plugin handles this on each node as required.

I am hoping that as soon as our cluster is at AWS2.7.1 when all nodes are ES1.7 that the snapshots will work using the Extended Request ID mechanism which I think is the issue with the AWS2.5.1 plugin perhaps?

All nodes which have a primary shard you want to backup are involved.

I don't know anything about extended Id.

Now got ES1.7 and AWS2.7.1 across 7 nodes but still get this error on some nodes:-

IndexShardSnapshotFailedException[[ethan-pa-raw-2016-05-01][3] Failed to perform snapshot (index files)]; nested: IOException[Unable to upload multipart request [null] for object indices/ethan-pa-raw-2016-05-01/3/__4e due to AmazonS3Exception: Invalid Argument (Service: Amazon S3; Status Code: 400; Error Code: InvalidArgument; Request ID: 090c17c9-552c-4814-80ff-80ed0fa20619)];

This error was for 3 of the 5 shards?

May be your Cleversafe service does not support multipart upload?

May be increase buffer_size to 5gb so it will always use single upload?

Read https://github.com/elastic/elasticsearch-cloud-aws/tree/es-1.7#s3-repository

It rejects attempt to set buffer_size to 5Gb as too big.

It was previously set to 100mb which would have meant a multipart upload I would have thought for even the 3Gb indices?

So try just under 5gb. Like 4gb?

If it was set to 100mb and you had files bigger than 100mb then I guess multipart upload was used.