ES version: 5.6.4, using Docker 18.03
We have a large index (1.1TB, compressed) set up to take snapshots to S3 using the repository-s3 plugin, and have already successfully snapshotted a previous version of the index. We then topped off the index with a new large batch of documents and the snapshotting no longer works. We get the following message (obfuscated for security reasons):
$ curl -X PUT "http://localhost:9200/_snapshot/ourbucketname/ourindexname_20180830203153?wait_for_completion=true" -d '{"indices": "ourindexname", "ignore_unavailable": true, "include_global_state": true}'
{"error":{"root_cause":[{"type":"snapshot_creation_exception","reason":"[ourbucketname:ourindexname_20180830203153/12EN27IYRhOCZ6oZLTlW6g] failed to create snapshot"}],"type":"snapshot_creation_exception","reason":"[ourbucketname:ourindexname_20180830203153/12EN27IYRhOCZ6oZLTlW6g] failed to create snapshot","caused_by":{"type":"i_o_exception","reason":"Unable to upload object meta-12EN27IYRhOCZ6oZLTlW6g.dat","caused_by":{"type":"amazon_s3_exception","reason":"amazon_s3_exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 1624E370C9FD6E05)"}}},"status":500}
The repository is indeed registered correctly, as verified by GET'ing the _snapshot endpoint.
As I mentioned, we had this working previously and have not changed any of the IAM roles or permissions. The IAM config looks like this (the recommended settings from the horse's mouth):
{
"Statement": [
{
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::ourbucketname"
]
},
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::ourbucketname/*"
]
}
],
"Version": "2012-10-17"
}
We can list, put and get objects using the AWS CLI from every one of the nodes in our cluster (confirming that the IAM role is working, and we double-checked that there is no .aws directory confounding things here).
Two things have changed:
(1) AWS now requires v4 encryption to be enabled for S3 access. Not sure when AWS started enforcing upgrades to the encryption protocols, however.
This required doing the following (as per this link https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingAWSSDK.html):
aws configure set s3.signature_version s3v4
ES_JAVA_OPTS=-Dcom.amazonaws.services.s3.enableV4
[The latter was added to the JVM options in the Docker settings -- after the Xms/x values.]
And (2) we up-sized the EC2 instances (doing a rolling restart). The snapshotting failures began, however, before we did this.
We even created a fresh new bucket with identical IAM access permissions (by placing it in that IAM policy above). This had no effect. Turning server-side default encryption on and off (and to different values) had no effect.
We have not changed the version of the repository-s3 plug-in, as it is baked into the Docker image.
We have a dev/QA clone of this cluster backing up to a distinct bucket, using the identical IAM policy (this bucket is in fact mentioned in the self-same IAM policy -- I just left it out for brevity). There have been no issues with these snapshotting runs.
We are at our wits' end. Please advise.