Hey,
I'm running daily backups of my ES cluster (using AWS Data Pipeline).
Every now and then, my backup fails due to AmazonS3Exception status code 500 (see below).
When it fails, my scheduler (i.e Data Pipeline) automatically deletes the PARTIAL snapshot and re-creates the snapshot.
However, each retry takes hours, and I'm trying to shorten this process.
I've contacted AWS support, their saying that :
They need the request ID AND the extended request ID, whereas the AWS cloud plugin logs only the request ID.
5xx errors are actually to be expected as part of normal interaction with the S3 service.
Has anyone encountered such errors?
Alternatively, is there a way to get the extended request ID from the AWS cloud plugin?
Thanks!
The output from the snapshot CURL :
snapshots [ {
snapshot mysnapshot-08-05-2016
indices [ aaa bbb ccc ddd ]
state PARTIAL
start_time 2016-05-08T102226.980Z
start_time_in_millis 1462702946980
end_time 2016-05-08T133645.543Z
end_time_in_millis 1462714605543
duration_in_millis 11658563
failures [ {
node_id xxxyyyzzz
index aaa
reason IndexShardSnapshotFailedException[[aaa][18] We encountered an internal error. Please try again. (Service Amazon S3; Status Code 500; Error Code InternalError; Request ID 3EA6454E835F5977)]; nested AmazonS3Exception[We encountered an internal error. Please try again. (Service Amazon S3; Status Code 500; Error Code InternalError; Request ID 3EA6454E835F5977)];
shard_id 18
status INTERNAL_SERVER_ERROR
} ]
shards {
total XX
failed 1
successful YY
}
} ]
If anyone's interested, seems like the reason for these failures is an excessive PUT request rate to S3.
Since the default chunk_size is 100mb and buffer_size is 5mb (see https://github.com/elastic/elasticsearch-cloud-aws/tree/es-1.5), it causes each file to be broken into many parts (uploaded using the AWS multipart upload API), which in turn results in excessive request rate.
After discussing it with AWS support and reviewing a similar issue in GitHub (https://github.com/elastic/elasticsearch/issues/17244), I’ve changed my repo settings to :
{
“my-repo": {
"type": "s3",
"settings": {
"bucket": “my-repo",
"chunk_size": "1gb",
"max_restore_bytes_per_sec": "8000mb",
"max_retries": "30",
"buffer_size": "100mb",
"max_snapshot_bytes_per_sec": "8000mb"
}
}
}
Increasing the buffer_size and chunk_size seem to solve the problem (at least for now).
Hey,
As you can see from my comment above, the settings you use are similar to mine (except the protocol, which we don't set and I think the default is https).
With those settings, I'm able to take snapshots of indices of up to a few TBs each.
Just to verify:
What errors are you seeing?
Are you using the same versions mentioned above (ES 1.5.2 and plugin 2.5.1)?
We have 5 shards per index and indices are between 2Gb to 10Gb so 400Mb to 2Gb (I guess?)
When the snapshot fails it always gives us an error message as follows listed below.
I am kicking these snapshots off through kopf and it shows IN_PROGRESS for a while and then goes to PARTIAL
[2016-10-07 14:28:31,387][WARN ][snapshots ] [es-pa002] [[ethan-pa-raw-2016-05-01][1]] [s3:ethan-pa-raw-2016-05-01] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: [ethan-pa-raw-2016-05-01][1] Invalid Argument (Service: Amazon S3; Status Code: 400; Error Code: InvalidArgument; Request ID: 626dd09c-9356-4f3a-b389-a6f6e66526c7)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.snapshot(BlobStoreIndexShardRepository.java:150)
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:85)
at org.elasticsearch.snapshots.SnapshotsService$5.run(SnapshotsService.java:817)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Invalid Argument (Service: Amazon S3; Status Code: 400; Error Code: InvalidArgument; Request ID: 626dd09c-9356-4f3a-b389-a6f6e66526c7), S3 Extended Request ID: null
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1127)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:743)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:462)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:297)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3672)
at com.amazonaws.services.s3.AmazonS3Client.doUploadPart(AmazonS3Client.java:2808)
at com.amazonaws.services.s3.AmazonS3Client.uploadPart(AmazonS3Client.java:2793)
at org.elasticsearch.cloud.aws.blobstore.DefaultS3OutputStream.doUploadMultipart(DefaultS3OutputStream.java:215)
at org.elasticsearch.cloud.aws.blobstore.DefaultS3OutputStream.uploadMultipart(DefaultS3OutputStream.java:187)
at org.elasticsearch.cloud.aws.blobstore.DefaultS3OutputStream.flush(DefaultS3OutputStream.java:83)
at org.elasticsearch.cloud.aws.blobstore.S3OutputStream.flushBuffer(S3OutputStream.java:71)
at org.elasticsearch.cloud.aws.blobstore.S3OutputStream.write(S3OutputStream.java:79)
at java.io.OutputStream.write(Unknown Source)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshotFile(BlobStoreIndexShardRepository.java:557)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshot(BlobStoreIndexShardRepository.java:500)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.snapshot(BlobStoreIndexShardRepository.java:140)
... 5 more
[2016-10-07 14:28:36,217][INFO ][snapshots ] [es-pa002] snapshot [s3:ethan-pa-raw-2016-05-01] is done
I see.
That's not a lot of data so taking a snapshot shouldn't be a problem in theory.
However, what you've mentioned looks a bit different from the problem we had, since we got "Status Code 500; Error Code InternalError", and you're getting "Status Code: 400; Error Code: InvalidArgument".
Have you tried contacting AWS support?
I think my problem relates back to the earlier comments you made regarding the 'Extended ID'?
They need the request ID AND the extended request ID, whereas the AWS cloud plugin logs only the request ID.
Has anyone encountered such errors?
Alternatively, is there a way to get the extended request ID from the AWS cloud plugin?
We are not using an actual S3 service but an internal S3 equivalent provided by CleverSafe.
Do you think that maybe the earlier version of the AWS Cloud plugin just does not support this 'Extended ID' mechanism and that this is required when files get bigger than some trivial size?
I'm not really sure who I can ask for this level of detailed analysis so just wondered if anyone else had encountered the issue?
I've also fired off a request to CleverSafe to see if they can shed any light on the issue.
Hey,
The plugin, at least in the version we're using, doesn't display the extended ID, so AWS support were unable to locate the specific request in their logs, but they were able to give me an idea of possible root cause.
I'm not sure what are possible root causes for the error you're getting (with status code 400), but perhaps your service provider (CleverSafe) can help.
Sorry I can't help you more...
We're upgrading the cluster to ES1.7 + AWS2.7.1 to see if that improves things for now...
The plan is to be at ES2.4 soon enough but we need to backup the cluster to S3 first hence the 1.5 -> 1.7 step for now.
We have upgraded some of the nodes to 1.7 but the AWS Cloud plugin still fails but maybe when they're all at ES1.7 it will be better as all will then have AWS2.7.1.
In terms of signing, we only use the HTTP (not HTTPs) protocol so that should not be an issue should it?
OK thanks, our provider is IBM CleverSafe so I'm sure they're using the latest AWS API spec...
Do you know if all nodes that are involved in a snapshot (I'm assuming all nodes which contains shards involved in the snapshot) need to be at the appropriate level.
I assume that the snapshot is initiated by a particular node but the plugin handles this on each node as required.
I am hoping that as soon as our cluster is at AWS2.7.1 when all nodes are ES1.7 that the snapshots will work using the Extended Request ID mechanism which I think is the issue with the AWS2.5.1 plugin perhaps?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.