Snapshot problems with Amazon S3


#1

Hi!

Elasticsearch: 5.4.2
Cloud: AWS
OS: Amazon Linux

Since 17th of November we get failures when our daily backup snapshots are created:

"failures": [
        {
          "index": "user-reviews",
          "index_uuid": "user-reviews",
          "shard_id": 1,
          "reason": "IndexShardSnapshotFailedException[com.amazonaws.AmazonClientException: Unable to execute HTTP request: connect timed out]; nested: AmazonClientException[Unable to execute HTTP request: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
          "node_id": "Ao5nMnDNSrmNarISSowGgA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        ...
]

The indices which are throwing those failures are changing, so we don't see any pattern there.

We then tried to create a new snapshot repository on S3 but we got this error:

{
  "error": {
    "root_cause": [
      {
        "type": "repository_verification_exception",
        "reason": "[repo-3] path  is not accessible on master node"
      }
    ],
    "type": "repository_verification_exception",
    "reason": "[repo-3] path  is not accessible on master node",
    "caused_by": {
      "type": "i_o_exception",
      "reason": "Unable to upload object tests-2lCYsX97Q8-qz15qPdeS0Q/master.dat-temp",
      "caused_by": {
        "type": "amazon_s3_exception",
        "reason": "amazon_s3_exception: The request signature we calculated does not match the signature you provided. Check your key and signing method. (Service: Amazon S3; Status Code: 403; Error Code: SignatureDoesNotMatch; Request ID: 4398DBFF11FFA4E7)"
      }
    }
  },
  "status": 500
}

This leads us to the assumption that something has been changed related to Amazon S3.

We saw that the aws sdk is upgraded in version 5.6.5, could this be a solution to our problem?

Kind Regards,
Bernhard


(Mujtaba Hussain) #2

Hey!

  • Can you PUT a file on the S3 bucket from the CLI from any of the nodes?
  • Have you double checked what IAM permissions you have given to the nodes to be able to access S3?

Also, this may not be related, but last time I had this error

was due to NTP issues. :frowning:

But confirm the first two things :slight_smile:


#3

Hi!

I tested uploading a file with AWS CLI 1.11.83 and 1.10.67 and it is working fine.
I put the file in the same bucket as Elasticsearch should put the snapshots, which means the IAM permissions are fine as well.

So I guess the first to points are confirmed?

Kind Regards,
Bernhard


(David Pilato) #4

Can you also remove the file you uploaded from the CLI?


#5

Hi!

Removing the file using the CLI also works.

I just tried again to create a snapshot repository but from a test cluster with a single node.
In scenario 1 the cluster is on 5.4.2 like our production cluster.
In scenario 2 the cluster is on 5.6.5 with upgraded AWS libraries.
Here are the errors:

// 5.4.2
{
	"error": {
		"root_cause": [{
			"type": "repository_verification_exception",
			"reason": "[cluster] path  is not accessible on master node"
		}],
		"type": "repository_verification_exception",
		"reason": "[cluster] path  is not accessible on master node",
		"caused_by": {
			"type": "i_o_exception",
			"reason": "Unable to upload object tests-WYj-gzdiQTqWXnPwxeNllQ/master.dat-temp",
			"caused_by": {
				"type": "amazon_s3_exception",
				"reason": "The request signature we calculated does not match the signature you provided. Check your key and signing method. (Service: Amazon S3; Status Code: 403; Error Code: SignatureDoesNotMatch; Request ID: 1BDDD42316F78C63)"
			}
		}
	},
	"status": 500
}
// 5.6.5
{
	"error": {
		"root_cause": [{
			"type": "repository_exception",
			"reason": "[cluster] failed to create repository"
		}],
		"type": "repository_exception",
		"reason": "[cluster] failed to create repository",
		"caused_by": {
			"type": "amazon_s3_exception",
			"reason": "Method Not Allowed (Service: Amazon S3; Status Code: 405; Error Code: 405 Method Not Allowed; Request ID: 3FAEFF31D2A9A406; S3 Extended Request ID: GXfAJDL54/WicbXdIJcMJI7RSl17eGc6VexJnJpvavlGSS6ByGLOqi1FnhiD1cMn3oxplCkEkcE=)"
		}
	},
	"status": 500
}

Kind Regards,
Bernhard


(Mujtaba Hussain) #6

Could you show us the IAM permissions you have allowed the cluster for S3?


#7

This is our IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::elasticsearch-5-snapshots",
                "arn:aws:s3:::elasticsearch-5-snapshots/*",
                "arn:aws:s3:::elasticsearch-5-archive",
                "arn:aws:s3:::elasticsearch-5-archive/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "EC2:Describe*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:DeleteAlarms",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:PutMetricAlarm",
            "Resource": "*"
        }
    ]
}

(Mujtaba Hussain) #8

I highly recommend raising an AWS support issue as well via the console. !


#9

Hi!

I noticed that one particular node was mostly causing the snapshot failures.
Today I took the node out of the cluster, reinstalled ES 5.4.2 and rebooted the machine.
I started another snapshot and it was successful!

In case this happens again I will try just a reboot first to see if this already solves the issue.

Thanks for your help!

Kind Regards,
Bernhard


(Mujtaba Hussain) #10

This is my suspicion about what was happening.

AWS SDK's sign any requests that emerge from one resource to another, so the destination can ensure that the request is coming from one of the valid resources. One of the things they use to sign the request is the time at which the request was made. This is where this comes in:

taken from here.

So if you follow the setup of this document, hopefully you can atleast negate this issue.

This issue might also explain why it was happening only on one instance.

Again, this is my suspicion. If this issue re-emerges, try this first before rebooting and let us know. :slight_smile:

Best of luck.


#11

The issue appears again and I wanted to follow your advice, but the links in your last answer are returning a 404.


(Mujtaba Hussain) #12

Yeah! Those pages from AWS Support Docs seem to have disappeared. :frowning:

Google system clock drift AWS and you are on your way :slight_smile:


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.