Snapshot restore from S3 bucket in different region not working

We have aws instances that are setup in different regions (eu-central-1, ap-south-1 etc). Each aws instance has its own ES single-node cluster running within the instance. There are s3 buckets created for each region that has the snapshots getting written daily from the instance. We use curator for managing the ES snapshot and restore process.

Until recently, for analytical purposes, we could restore indices from the snapshot of different region instance onto a spot instance that was running in another region. But this stopped working recently. If the spot instance were setup in the same region as the s3 bucket, then the restore completes. For eg: the snapshot would be in an s3 bucket that was setup in eu-central-1, and we are trying to restore it in an ES instance in ap-south-1. I have tried giving the endpoint setting and without it in the repo settings. Both do not work now.

Attached is the trace enabled log file when the restore does not work. Any help with this is appreciated. Do let me know if additional logs or settings are needed.

We are using elasticsearch version 6.5.1

[2019-09-17T05:43:23,891][DEBUG][o.a.h.i.c.PoolingHttpClientConnectionManager] [tkHi76j] Connection released: [id: 8][route: {s}->https://sl-de-es5-biz.s3.eu-central-1.amazonaws.com:443][total kept alive: 1; route allocated: 1 of 50; total allocated: 1 of 50]
[2019-09-17T05:43:23,896][DEBUG][o.e.c.s.MasterService    ] [tkHi76j] processing [restore_snapshot[curator-20190131013004]]: execute
[2019-09-17T05:43:23,904][DEBUG][o.e.c.r.a.a.BalancedShardsAllocator] [tkHi76j] skipping rebalance due to in-flight shard/store fetches
[2019-09-17T05:43:23,905][DEBUG][o.e.c.s.MasterService    ] [tkHi76j] cluster state updated, version [8], source [restore_snapshot[curator-20190131013004]]
[2019-09-17T05:43:23,905][DEBUG][o.e.c.s.MasterService    ] [tkHi76j] publishing cluster state version [8]
[2019-09-17T05:43:23,905][DEBUG][o.e.c.s.ClusterApplierService] [tkHi76j] processing [apply cluster state (from master [master {tkHi76j}{tkHi76jLQ7C-M8qyCHNRcg}{qjO2MKEJRLmLtRYFzWEb_Q}{10.0.0.204}{10.0.0.204:9300}{ml.machine_memory=32151224320, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [8] source [restore_snapshot[curator-20190131013004]]])]: execute
[2019-09-17T05:43:23,905][DEBUG][o.e.c.s.ClusterApplierService] [tkHi76j] cluster state updated, version [8], source [apply cluster state (from master [master {tkHi76j}{tkHi76jLQ7C-M8qyCHNRcg}{qjO2MKEJRLmLtRYFzWEb_Q}{10.0.0.204}{10.0.0.204:9300}{ml.machine_memory=32151224320, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [8] source [restore_snapshot[curator-20190131013004]]])]
[2019-09-17T05:43:23,905][DEBUG][o.e.c.s.ClusterApplierService] [tkHi76j] applying cluster state version 8
[2019-09-17T05:43:23,905][DEBUG][o.e.c.s.ClusterApplierService] [tkHi76j] apply cluster state with version 8
[2019-09-17T05:43:23,929][DEBUG][o.e.c.s.ClusterApplierService] [tkHi76j] set locally applied cluster state to version 8

Can you provide the full configuration that you've used? In particular, how did you configure the endpoint?

Thanks for replying. Here is the repository configuration that I tested the restore command just now, which again did not work. The endpoint setting is given in the repo configuration.

curl http://localhost:9200/_snapshot/sl_es_s3_repo_mx?pretty
{
  "sl_es_s3_repo_mx" : {
    "type" : "s3",
    "settings" : {
      "bucket" : "sl-mx-es5-biz",
      "chunk_size" : "500mb",
      "endpoint" : "s3.us-west-1.amazonaws.com",
      "region" : "us-west-1",
      "buffer_size" : "250mb"
    }
  }
}

I have also tried with giving the endpoint in the elasticsearch.yml file and verified that it is being picked up in the logs, as in

grep "using end" shortlyst-in-dev02-2019-09-20-322.log
[2019-09-20T11:47:28,094][DEBUG][o.e.r.s.S3Service        ] [tkHi76j] using endpoint [s3.us-west-1.amazonaws.com]

But the restore does not go through.
Please let me know if you would like to look at any logs and I can share that. I am running with the logs enabled as below.

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -H 'Content-Type: application/json' -d'
{
  "transient": {
    "logger._root":"DEBUG",
    "logger.org.elasticsearch.repositories.s3": "trace",
    "logger.com.amazon": "trace"
  }
}'

Have you correctly configured all nodes in the cluster with this configuration?

Can you share the full logs? In case you don't want to share them publicly, you can e-mail them to "yannick AT elastic DOT co". In particular I'm interested in the log messages that are following the one where the endpoint is successfully set. There should be a log line where it complains about cross-region access of bucket.

[2019-09-20T11:47:28,094][DEBUG][o.e.r.s.S3Service        ] [tkHi76j] using endpoint [s3.us-west-1.amazonaws.com]

I am running a single node cluster. I have sent the full logs to your email. You can see two attempts of the restore that did not go through, and as you mentioned there is a warning below the endpoint setting in both the runs.

[2019-09-20T14:43:29,572][DEBUG][o.e.r.s.S3Service        ] [tkHi76j] using endpoint [s3.amazonaws.com]
[2019-09-20T14:48:40,936][DEBUG][o.e.r.s.S3Service        ] [tkHi76j] using endpoint [s3.us-west-1.amazonaws.com

I've looked at the logs, but they don't contain any info as to why the restore failed. What is the error you were getting as response to the _restore request? It looks like the shards are not being restored for some reason, can you run the cluster allocation explain API against some of the indices that failed to restore?

Interestingly enough, the restore is working fine since yesterday. I have tried restoring from several different regions, but it is all working now. The warning message below the endpoint setting is still appearing in the log file, but I guess that does not have any implication. I am unable to have a scenario where it is failing to restore so I can run the cluster allocation explain API now. Could this have been caused by any intermittent aws issue? I will anyways test this for the next couple of days and update here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.