Repository-s3: snapshot_creation_exception, unable to load object


(Dennis N. Mehay) #1

ES version: 5.6.4, using Docker 18.03

We have a large index (1.1TB, compressed) set up to take snapshots to S3 using the repository-s3 plugin, and have already successfully snapshotted a previous version of the index. We then topped off the index with a new large batch of documents and the snapshotting no longer works. We get the following message (obfuscated for security reasons):

$ curl -X PUT "http://localhost:9200/_snapshot/ourbucketname/ourindexname_20180830203153?wait_for_completion=true" -d '{"indices": "ourindexname", "ignore_unavailable": true, "include_global_state": true}'


{"error":{"root_cause":[{"type":"snapshot_creation_exception","reason":"[ourbucketname:ourindexname_20180830203153/12EN27IYRhOCZ6oZLTlW6g] failed to create snapshot"}],"type":"snapshot_creation_exception","reason":"[ourbucketname:ourindexname_20180830203153/12EN27IYRhOCZ6oZLTlW6g] failed to create snapshot","caused_by":{"type":"i_o_exception","reason":"Unable to upload object meta-12EN27IYRhOCZ6oZLTlW6g.dat","caused_by":{"type":"amazon_s3_exception","reason":"amazon_s3_exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 1624E370C9FD6E05)"}}},"status":500}

The repository is indeed registered correctly, as verified by GET'ing the _snapshot endpoint.

As I mentioned, we had this working previously and have not changed any of the IAM roles or permissions. The IAM config looks like this (the recommended settings from the horse's mouth):

{
    "Statement": [
        {
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:ListBucketMultipartUploads",
                "s3:ListBucketVersions"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::ourbucketname"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::ourbucketname/*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

We can list, put and get objects using the AWS CLI from every one of the nodes in our cluster (confirming that the IAM role is working, and we double-checked that there is no .aws directory confounding things here).

Two things have changed:

(1) AWS now requires v4 encryption to be enabled for S3 access. Not sure when AWS started enforcing upgrades to the encryption protocols, however.

This required doing the following (as per this link https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingAWSSDK.html):

aws configure set s3.signature_version s3v4
ES_JAVA_OPTS=-Dcom.amazonaws.services.s3.enableV4

[The latter was added to the JVM options in the Docker settings -- after the Xms/x values.]

And (2) we up-sized the EC2 instances (doing a rolling restart). The snapshotting failures began, however, before we did this.

We even created a fresh new bucket with identical IAM access permissions (by placing it in that IAM policy above). This had no effect. Turning server-side default encryption on and off (and to different values) had no effect.

We have not changed the version of the repository-s3 plug-in, as it is baked into the Docker image.

We have a dev/QA clone of this cluster backing up to a distinct bucket, using the identical IAM policy (this bucket is in fact mentioned in the self-same IAM policy -- I just left it out for brevity). There have been no issues with these snapshotting runs.

We are at our wits' end. Please advise.


(Mike Hoskins) #2

Bump +1

I'm getting the same issue, myself. I even tried giving s3:* access to both "Action" clauses, but there was no change. It was working, then suddenly stopped working....

Any way to fix?


(Mike Hoskins) #3

Any word on this?


(Dennis N. Mehay) #4

No, none yet...as you can see!


(Mike Hoskins) #5

Hmm. Anyone?


(Mike Hoskins) #6

It looks like I found a solution.

After trying out local-storage snapshots, instead of S3, I found that I had to make sure all my nodes in the cluster had permission to write to local storage (via NFS). Until all of these nodes were able to write to the same place, I couldn't even perform local storage snapshots.

So, going back to the repository-s3 snapshots, I found that cluster members did not have the same permissions as each other. I got it working by being sure all my cluster members had the same permissions - face palm! :slight_smile:

It would be nice if the docs for all the snapshot solutions (in bold letters) reminded people that clusters need all nodes to have the same permissions to write files. :slight_smile:


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.