Snapshot problems with Amazon S3

baltendo · December 14, 2017, 12:22pm

Hi!

Elasticsearch: 5.4.2
Cloud: AWS
OS: Amazon Linux

Since 17th of November we get failures when our daily backup snapshots are created:

"failures": [
        {
          "index": "user-reviews",
          "index_uuid": "user-reviews",
          "shard_id": 1,
          "reason": "IndexShardSnapshotFailedException[com.amazonaws.AmazonClientException: Unable to execute HTTP request: connect timed out]; nested: AmazonClientException[Unable to execute HTTP request: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
          "node_id": "Ao5nMnDNSrmNarISSowGgA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        ...
]

The indices which are throwing those failures are changing, so we don't see any pattern there.

We then tried to create a new snapshot repository on S3 but we got this error:

{
  "error": {
    "root_cause": [
      {
        "type": "repository_verification_exception",
        "reason": "[repo-3] path  is not accessible on master node"
      }
    ],
    "type": "repository_verification_exception",
    "reason": "[repo-3] path  is not accessible on master node",
    "caused_by": {
      "type": "i_o_exception",
      "reason": "Unable to upload object tests-2lCYsX97Q8-qz15qPdeS0Q/master.dat-temp",
      "caused_by": {
        "type": "amazon_s3_exception",
        "reason": "amazon_s3_exception: The request signature we calculated does not match the signature you provided. Check your key and signing method. (Service: Amazon S3; Status Code: 403; Error Code: SignatureDoesNotMatch; Request ID: 4398DBFF11FFA4E7)"
      }
    }
  },
  "status": 500
}

This leads us to the assumption that something has been changed related to Amazon S3.

We saw that the aws sdk is upgraded in version 5.6.5, could this be a solution to our problem?

Kind Regards,
Bernhard

mujtabahussain · December 15, 2017, 5:31am

Hey!

Can you PUT a file on the S3 bucket from the CLI from any of the nodes?
Have you double checked what IAM permissions you have given to the nodes to be able to access S3?

Also, this may not be related, but last time I had this error

was due to NTP issues.

But confirm the first two things

baltendo · December 15, 2017, 8:41am

Hi!

I tested uploading a file with AWS CLI 1.11.83 and 1.10.67 and it is working fine.
I put the file in the same bucket as Elasticsearch should put the snapshots, which means the IAM permissions are fine as well.

So I guess the first to points are confirmed?

Kind Regards,
Bernhard

dadoonet · December 15, 2017, 8:54am

Can you also remove the file you uploaded from the CLI?

baltendo · December 15, 2017, 9:09am

Hi!

Removing the file using the CLI also works.

I just tried again to create a snapshot repository but from a test cluster with a single node.
In scenario 1 the cluster is on 5.4.2 like our production cluster.
In scenario 2 the cluster is on 5.6.5 with upgraded AWS libraries.
Here are the errors:

// 5.4.2
{
	"error": {
		"root_cause": [{
			"type": "repository_verification_exception",
			"reason": "[cluster] path  is not accessible on master node"
		}],
		"type": "repository_verification_exception",
		"reason": "[cluster] path  is not accessible on master node",
		"caused_by": {
			"type": "i_o_exception",
			"reason": "Unable to upload object tests-WYj-gzdiQTqWXnPwxeNllQ/master.dat-temp",
			"caused_by": {
				"type": "amazon_s3_exception",
				"reason": "The request signature we calculated does not match the signature you provided. Check your key and signing method. (Service: Amazon S3; Status Code: 403; Error Code: SignatureDoesNotMatch; Request ID: 1BDDD42316F78C63)"
			}
		}
	},
	"status": 500
}

// 5.6.5
{
	"error": {
		"root_cause": [{
			"type": "repository_exception",
			"reason": "[cluster] failed to create repository"
		}],
		"type": "repository_exception",
		"reason": "[cluster] failed to create repository",
		"caused_by": {
			"type": "amazon_s3_exception",
			"reason": "Method Not Allowed (Service: Amazon S3; Status Code: 405; Error Code: 405 Method Not Allowed; Request ID: 3FAEFF31D2A9A406; S3 Extended Request ID: GXfAJDL54/WicbXdIJcMJI7RSl17eGc6VexJnJpvavlGSS6ByGLOqi1FnhiD1cMn3oxplCkEkcE=)"
		}
	},
	"status": 500
}

Kind Regards,
Bernhard

mujtabahussain · December 17, 2017, 11:55pm

Could you show us the IAM permissions you have allowed the cluster for S3?

baltendo · December 18, 2017, 7:23am

This is our IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::elasticsearch-5-snapshots",
                "arn:aws:s3:::elasticsearch-5-snapshots/*",
                "arn:aws:s3:::elasticsearch-5-archive",
                "arn:aws:s3:::elasticsearch-5-archive/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "EC2:Describe*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:DeleteAlarms",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:PutMetricAlarm",
            "Resource": "*"
        }
    ]
}

mujtabahussain · December 18, 2017, 11:01pm

I highly recommend raising an AWS support issue as well via the console. !

baltendo · December 21, 2017, 1:29pm

Hi!

I noticed that one particular node was mostly causing the snapshot failures.
Today I took the node out of the cluster, reinstalled ES 5.4.2 and rebooted the machine.
I started another snapshot and it was successful!

In case this happens again I will try just a reboot first to see if this already solves the issue.

Thanks for your help!

Kind Regards,
Bernhard

mujtabahussain · December 21, 2017, 11:28pm

This is my suspicion about what was happening.

AWS SDK's sign any requests that emerge from one resource to another, so the destination can ensure that the request is coming from one of the valid resources. One of the things they use to sign the request is the time at which the request was made. This is where this comes in:

taken from here.

So if you follow the setup of this document, hopefully you can atleast negate this issue.

This issue might also explain why it was happening only on one instance.

Again, this is my suspicion. If this issue re-emerges, try this first before rebooting and let us know.

Best of luck.

baltendo · January 10, 2018, 8:03am

The issue appears again and I wanted to follow your advice, but the links in your last answer are returning a 404.

mujtabahussain · January 11, 2018, 11:52pm

Yeah! Those pages from AWS Support Docs seem to have disappeared.

Google system clock drift AWS and you are on your way

system · February 8, 2018, 11:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
S3 snapshotting timed out requiring Elasticsearch process restart - 6.3.0 Elasticsearch	6	3051	March 18, 2019
Timeout Exception Restoring from s3 Snapshot Elasticsearch	3	1477	July 25, 2017
Unable to take snapshot on S3 Elasticsearch	4	1305	December 17, 2017
Cannot list snapshots in S3 repository (previously working fine) Elasticsearch	5	2925	November 16, 2017
Snapshot failed due to connection timeout to S3 Elasticsearch	1	755	July 21, 2020

Snapshot problems with Amazon S3

Related topics