The indices which are throwing those failures are changing, so we don't see any pattern there.
We then tried to create a new snapshot repository on S3 but we got this error:
{
"error": {
"root_cause": [
{
"type": "repository_verification_exception",
"reason": "[repo-3] path is not accessible on master node"
}
],
"type": "repository_verification_exception",
"reason": "[repo-3] path is not accessible on master node",
"caused_by": {
"type": "i_o_exception",
"reason": "Unable to upload object tests-2lCYsX97Q8-qz15qPdeS0Q/master.dat-temp",
"caused_by": {
"type": "amazon_s3_exception",
"reason": "amazon_s3_exception: The request signature we calculated does not match the signature you provided. Check your key and signing method. (Service: Amazon S3; Status Code: 403; Error Code: SignatureDoesNotMatch; Request ID: 4398DBFF11FFA4E7)"
}
}
},
"status": 500
}
This leads us to the assumption that something has been changed related to Amazon S3.
We saw that the aws sdk is upgraded in version 5.6.5, could this be a solution to our problem?
I tested uploading a file with AWS CLI 1.11.83 and 1.10.67 and it is working fine.
I put the file in the same bucket as Elasticsearch should put the snapshots, which means the IAM permissions are fine as well.
I just tried again to create a snapshot repository but from a test cluster with a single node.
In scenario 1 the cluster is on 5.4.2 like our production cluster.
In scenario 2 the cluster is on 5.6.5 with upgraded AWS libraries.
Here are the errors:
// 5.4.2
{
"error": {
"root_cause": [{
"type": "repository_verification_exception",
"reason": "[cluster] path is not accessible on master node"
}],
"type": "repository_verification_exception",
"reason": "[cluster] path is not accessible on master node",
"caused_by": {
"type": "i_o_exception",
"reason": "Unable to upload object tests-WYj-gzdiQTqWXnPwxeNllQ/master.dat-temp",
"caused_by": {
"type": "amazon_s3_exception",
"reason": "The request signature we calculated does not match the signature you provided. Check your key and signing method. (Service: Amazon S3; Status Code: 403; Error Code: SignatureDoesNotMatch; Request ID: 1BDDD42316F78C63)"
}
}
},
"status": 500
}
I noticed that one particular node was mostly causing the snapshot failures.
Today I took the node out of the cluster, reinstalled ES 5.4.2 and rebooted the machine.
I started another snapshot and it was successful!
In case this happens again I will try just a reboot first to see if this already solves the issue.
AWS SDK's sign any requests that emerge from one resource to another, so the destination can ensure that the request is coming from one of the valid resources. One of the things they use to sign the request is the time at which the request was made. This is where this comes in:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.