Curator: Failed to verify all nodes have repository access


(Pierre) #1

Hello,

I'm having a few issues when trying to create snapshots using curator.

When I run the POST http://localhost:9200/_snapshot/curator-production-dashboards/_verify?pretty and get a list of nodes that are failing verification with the following contents.

"RemoteTransportException[[master-production-all-001][inet[/x.x.x.x:9300]][cluster:admin/repository/verify]]; nested: RepositoryVerificationException[[curator-production-dashboards] [_QW8Wh3xRd6eYLJKSVZmJQ, 'RemoteTransportException[[node-in-account-2][inet[/x.x.x.x:9300]][internal:admin/repository/verify]]; nested: RepositoryVerificationException[[curator-production-dashboards] a file written by master to the store [eu-west-1/bucket-account-1] cannot be accessed on the node [[node-in-account-2][_QW8Wh3xRd6eYLJKSVZmJQ][ip-x-x-x-x][inet[/x.x.x.x:9300]]{master=false}]. This might indicate that the store [eu-west-1/bucket-account-1] is not shared between this node and the master node or that permissions on the store don't allow reading files written by the master node]"

A quick rundown of our setup:

  • We're in the process of migrating to a new AWS account and have Elasticsearch instances running 2 different AWS accounts
  • Curator is trying to create snapshots to a S3 bucket our original AWS account
  • I've set up a bucket policy and granted the necessary permissions to the new account, I know this is working as I can access these buckets from the instances in the new account using the same user our curator jobs are running as
    Permissions on the bucket

"Action": [
"s3:ListBucket",
"s3:Get*"
],

Permissions on the objects in the bucket
"Action": [
"s3:AbortMultipartUpload",
"s3:DeleteObject",
"s3:Get*",
"s3:ListMultipartUploadParts",
"s3:PutObject"
],

  • The Elasticsearch instances in the new account are running the same version of curator (3.0.3) but a more recent version of Elasticsearch (1.7.1) and Elasticsearch (1.6.0) in the original AWS account
  • I've set verify to false but it doesn't look like indices from the new account could uploaded to S3

Questions:

  • Is the different version of Elasticsearch causing verification to fail?
  • Could there be a delay while verifying the file that the master is writing to S3 from the nodes in the new account that is causing the failure?
  • What file is master writing to the bucket that the nodes are failing to read, I can't see anything in the bucket?

Kind Regards,
Pierre


(Aaron Mildenstein) #2

Hi! Sorry to hear you're having a rough time with Curator. It is possible that slight delays in AWS are causing you some grief. In those cases you can use the --skip-repo-validation flag, which was added expressly for these times.

As noted in the linked documentation, though, you will have to update Curator to the latest version to get that feature (you're on 3.0.3 and Curator is at 3.3.0, currently).


(Pierre) #3

Hello Aaron,

Thanks for your suggestion, I've just upgraded curator to 3.3.0 on the master node I'm running the backup from and used the --skip-repo-validation flag.

The job failed with the following message which I guess indicates that only the nodes in the same account as the S3 bucket managed to upload their shards.

Snapshot PARTIAL completed with state: PARTIAL

I've tried to make the S3 bucket public but am still unable to get the nodes in the new AWS account to work.

Do I need to upgrade curator on all Elasticsearch nodes?

Kind Regards,
Pierre


(Pierre) #4

Hi again,

What file does the master node write to the bucket when doing the verify, just want to ensure I can see it from the nodes in the new account.

Kind Regards,
Pierre


(Aaron Mildenstein) #5

Ah. Then it wasn't a case of AWS network timing causing a false positive. Curator was accurately catching the exact scenario that code was put in for.

Repository Verification is described in the official Elasticsearch documentation. Curator does not copy a file, but rather just uses the API provided by Elasticsearch. I do not know what file Elasticsearch sends or creates to test that all nodes have access to the shared filesystem. Clearly your shared filesystem is not fully shared by each member node. You'll have to do some tests of your own.

This could be as simple as using su to become the Elasticsearch user (or whatever user is running Elasticsearch on your nodes) and create/touch a file in all paths setup in your path.repo. Each node should be able to see the file, and create new files. If not, repository verification will fail. In this instance, Curator's default use of repository verification is your touchstone, guaranteeing all nodes are fully ready before attempting to take a snapshot.

No. Curator only needs to exist in one place. That place must have client access to a node in the cluster. It doesn't have to be the elected master, or even a master node at all. The only exception to this is if you're using a distributed approach and are putting Curator on all nodes, but only having it run on the elected master with the --master-only flag. In such a case, then that's where you'd want the same version installed everywhere.


(Pierre) #6

Hello,

Still not having much luck, I'm sure it's a config error but am unable to work out what's going on.

I'm also sure its an issue with the nodes in the new account. I created a repository in the same account they're running in and got the same message saying that they were unable to read the bucket.

I noticed that there was a difference in time which I rectified by pointing all nodes at the same NTP server, I've also verified that all nodes have the correct plugins installed running, curl localhost:9200/_cat/plugins?v

2 more questions:

When I try access the buckets I'm using the AWS CLI, is there a way I can test this access using the cloud-aws plugin?

The error message mentions a admin /repository, "cluster:admin/repository/verify". Is this different to the actual repository I'm trying to create?

Kind Regards,
Pierre


(Aaron Mildenstein) #7

Elasticsearch, internally, does the verify step at repository creation (as mentioned in my previous response). This implies that not all nodes have read/write access to the bucket. I don't know what else to say. The manual testing instructions I previously mentioned (creating file as Elasticsearch user on each node) are one of the best guaranteed tests for elasticsearch and repository availability I can think of.


(Pierre) #8

Hi,

I've solved the issue.

Different versions of elasticsearch require different versions of the aws-cloud plugin. We'd upgraded the version of elasticsearch in the new account to 1.7 but we were still using the old version of the aws-cloud plugin.

Kind Regards,
Pierre


(system) #9