Recovery from S3 gateway - only one shard recovers?

Hi everyone, I've got a bit of a problem here which is doing my head in
(and has been for the last 24hrs or so). Currently we are running
Elasticsearch 0.19.8 on Amazon EC2 with S3 as a gateway (we're planning on
migrating from S3 in the new year). One index, 3 replicas, 20 shards. We
have two clusters, one for production (with 6 nodes) and one for staging
(with 5 nodes), in different regions.

I wanted to copy the index from our production cluster over to our staging
cluster to do some performance testing on a larger data set, using PIOPS
volumes instead of ephemeral. So I've stopped elasticsearch in staging,
copied across the contents of the index and its metadata on S3 from the
production bucket to the staging bucket, removed the local copy of the data
from each staging node, and started up the staging cluster again. After
about 15 minutes the cluster recovers to a green state, but instead of
recovering 28gb of data, it recovers 1.4gb of data. Looking through
Elasticsearch Head, it shows that shards 0-12 have recovered with around
66-100b of data, shard 13 has 1.4gb of data, and 14-19 have between 56 and
99b of data. The full 28gb of data definitely copied across from production

  • I verified that it was valid in S3 before starting the cluster - yet
    after the cluster starts, the size of the index in the bucket shrinks to
    1.4gb. Thinking I must've stuffed something up, I figured I'd just copy it
    all across again and start over - yet it's done exactly the same thing
    twice now.

This is making very little sense to me and I'm wondering if anyone can
point me in the right direction. I'm at the point where I"m thinking I
should just completely rebuild the cluster from scratch and see if that
makes a difference - in case it's hanging onto old settings from when its
index was a lot smaller - but I'm open to suggestions!

Cheers,
James.

--

Further things I've tried since posting this:

  • created a new bucket for staging
  • created a whole new cluster

Still, only shard number 13 recovers from the gateway! Got me kinda
stumped.

All the files in each of the directories of my index get deleted

On Thursday, 20 December 2012 12:26:36 UTC+11, James Chisholm wrote:

Hi everyone, I've got a bit of a problem here which is doing my head in
(and has been for the last 24hrs or so). Currently we are running
Elasticsearch 0.19.8 on Amazon EC2 with S3 as a gateway (we're planning on
migrating from S3 in the new year). One index, 3 replicas, 20 shards. We
have two clusters, one for production (with 6 nodes) and one for staging
(with 5 nodes), in different regions.

I wanted to copy the index from our production cluster over to our staging
cluster to do some performance testing on a larger data set, using PIOPS
volumes instead of ephemeral. So I've stopped elasticsearch in staging,
copied across the contents of the index and its metadata on S3 from the
production bucket to the staging bucket, removed the local copy of the data
from each staging node, and started up the staging cluster again. After
about 15 minutes the cluster recovers to a green state, but instead of
recovering 28gb of data, it recovers 1.4gb of data. Looking through
Elasticsearch Head, it shows that shards 0-12 have recovered with around
66-100b of data, shard 13 has 1.4gb of data, and 14-19 have between 56 and
99b of data. The full 28gb of data definitely copied across from production

  • I verified that it was valid in S3 before starting the cluster - yet
    after the cluster starts, the size of the index in the bucket shrinks to
    1.4gb. Thinking I must've stuffed something up, I figured I'd just copy it
    all across again and start over - yet it's done exactly the same thing
    twice now.

This is making very little sense to me and I'm wondering if anyone can
point me in the right direction. I'm at the point where I"m thinking I
should just completely rebuild the cluster from scratch and see if that
makes a difference - in case it's hanging onto old settings from when its
index was a lot smaller - but I'm open to suggestions!

Cheers,
James.

--

Currently we are running Elasticsearch 0.19.8 on Amazon EC2 with S3 as a

gateway (we're planning on migrating from S3 in the new year).

I suggest you move out of the S3 gateway as soon as possible. It's actually
quite easy to migrate from S3 gateway to an EBS-backed local gateway,
copying instructions from earlier answer
https://groups.google.com/forum/#!msg/elasticsearch/sWp9XDzNmk8/YFRVSFtWFE4J:

  • Create a two new IOPS EBS volumes [1], with enough space to hold your data
  • Launch a new EC2 instance with proper security groups
  • Mount the EBS on the new instance [2], to a good location such as
    /usr/local/var/elasticsearch/data1
  • Install and configure elasticsearch on the machine, using the same
    cluster name as your original cluster, using a local gateway, pointed to
    the location where you mounted the EBS volume
  • Launch elasticsearch on these new instances
  • Increase the number_of_replicas for your indices to four (ie. equal to
    the number of nodes), or use the index.auto_expand_replicas setting. Your
    data will be now spread across all the nodes: the old ones, and the new
    ones.
  • Use Paramedic, BigDesk or Head elasticsearch plugins to monitor cluster
    health: once you're in a "green" health, and all shards are allocated, you
    can shutdown the old, S3-based nodes
  • You have migrated all data to a new cluster. The best practice now would
    be to do a snapshot of your EBS volumes, so you have a recovery strategy.
    You can delete the S3 buckets after doing that.

This strategy allows you to scale when the volume of your data grows and
the computing capacity of your cluster is enough: you can create a new set
of EBS volumes, mount them to a location such as
/usr/local/var/elasticsearch/data2 and point elasticsearch data.path to
both locations (it is possible to use multiple directories as the
data.path).

Karel

--

Our staging cluster is in a different AWS region from production - which is
why the S3 gateway appealed - here's what I'm doing now. Bear in mind that
i've removed the index from the staging cluster altogether, and need to
recover it from the gateway.

  • Created new EBS volume for each node in the cluster
  • sync'd the S3 gateway data to the new EBS volume.
  • changed the config to point at local fs gateway
  • changed the config to recover after 2 nodes join
  • verified the integrity of the gateway data on each node
  • started Elasticsearch on each node
  • realised i made a permissions error on each of node, stopped and
    corrected it
  • started it up again.
  • watched logs - got "recovered [1] indices into cluster_state"

And it's created a blank index with the right name, but no data. Each shard
is completely empty. And it's nuked the gateway which I copied down from
S3, on the local storage as well. So having moved to the FS gateway, I'm
now worse off than I was... with the same kind of symptoms.

I'm thinking I should worry about it next year at this point, but it is
something that does need to be worried about.

On Thursday, 20 December 2012 19:52:28 UTC+11, Karel Minařík wrote:

Currently we are running Elasticsearch 0.19.8 on Amazon EC2 with S3 as a

gateway (we're planning on migrating from S3 in the new year).

I suggest you move out of the S3 gateway as soon as possible. It's
actually quite easy to migrate from S3 gateway to an EBS-backed local
gateway, copying instructions from earlier answer <
Redirecting to Google Groups

:

  • Create a two new IOPS EBS volumes [1], with enough space to hold your
    data
  • Launch a new EC2 instance with proper security groups
  • Mount the EBS on the new instance [2], to a good location such as
    /usr/local/var/elasticsearch/data1
  • Install and configure elasticsearch on the machine, using the same
    cluster name as your original cluster, using a local gateway, pointed to
    the location where you mounted the EBS volume
  • Launch elasticsearch on these new instances
  • Increase the number_of_replicas for your indices to four (ie. equal to
    the number of nodes), or use the index.auto_expand_replicas setting. Your
    data will be now spread across all the nodes: the old ones, and the new
    ones.
  • Use Paramedic, BigDesk or Head elasticsearch plugins to monitor cluster
    health: once you're in a "green" health, and all shards are allocated, you
    can shutdown the old, S3-based nodes
  • You have migrated all data to a new cluster. The best practice now would
    be to do a snapshot of your EBS volumes, so you have a recovery strategy.
    You can delete the S3 buckets after doing that.

This strategy allows you to scale when the volume of your data grows and
the computing capacity of your cluster is enough: you can create a new set
of EBS volumes, mount them to a location such as
/usr/local/var/elasticsearch/data2 and point elasticsearch data.path to
both locations (it is possible to use multiple directories as the
data.path).

Karel

--

Our staging cluster is in a different AWS region from production - which
is why the S3 gateway appealed.

If you're familiar with Chef, possibly, the updated version of tutorial [1]
explains how to use EBS-based gateway and EBS snapshots as a recovery
mechanism in an automated way.

AWS recently announced the "EBS snapshot copy" feature, so this may be a
nice way for you how to shuffle data between production and staging
environments.

Karel

1: Elasticsearch Platform — Find real-time answers at scale | Elastic
2: Announcing EBS Snapshot Copy, a Step towards Easier Data Center Migration and Disaster Recovery

--

Ok so we are back from our holiday break and need to get some resolution on
this. Whilst we appreciate that your suggestion to move away from S3 is the
correct one moving forward, it doesn't actually help us with this problem.

What if we were to adjust our elasticsearch config to use local gateway,
and copy the contents of our S3 gateway into the local EBS volume on each
existing node, then start the cluster from there?

On Friday, 21 December 2012 22:14:47 UTC+11, Karel Minařík wrote:

Our staging cluster is in a different AWS region from production - which

is why the S3 gateway appealed.

If you're familiar with Chef, possibly, the updated version of tutorial
[1] explains how to use EBS-based gateway and EBS snapshots as a recovery
mechanism in an automated way.

AWS recently announced the "EBS snapshot copy" feature, so this may be a
nice way for you how to shuffle data between production and staging
environments.

Karel

1:
Elasticsearch Platform — Find real-time answers at scale | Elastic
2:
Announcing EBS Snapshot Copy, a Step towards Easier Data Center Migration and Disaster Recovery

--

Further to this - in the logs, it appears I'm not getting anything as
described in the EC2 guide

where it says its reading the state from the gateway and recovering:

[gateway.s3] reading state from gateway [snip] ...
[gateway.s3] read state from gateway [snip], took 97ms

It's just creating a new, blank index with the same name.

On Friday, 11 January 2013 15:03:18 UTC+11, James Chisholm wrote:

Ok so we are back from our holiday break and need to get some resolution
on this. Whilst we appreciate that your suggestion to move away from S3 is
the correct one moving forward, it doesn't actually help us with this
problem.

What if we were to adjust our elasticsearch config to use local gateway,
and copy the contents of our S3 gateway into the local EBS volume on each
existing node, then start the cluster from there?

On Friday, 21 December 2012 22:14:47 UTC+11, Karel Minařík wrote:

Our staging cluster is in a different AWS region from production - which

is why the S3 gateway appealed.

If you're familiar with Chef, possibly, the updated version of tutorial
[1] explains how to use EBS-based gateway and EBS snapshots as a recovery
mechanism in an automated way.

AWS recently announced the "EBS snapshot copy" feature, so this may be a
nice way for you how to shuffle data between production and staging
environments.

Karel

1:
Elasticsearch Platform — Find real-time answers at scale | Elastic
2:
Announcing EBS Snapshot Copy, a Step towards Easier Data Center Migration and Disaster Recovery

--

A bit confused about what are you trying to do actually. I still don't know
if you followed the steps I have posted or "sync'd the S3 gateway data to
the new EBS volume" via some other means.

All in all, if you're stuck with a broken shard, there's a really big
chance you need to get rid of that shard, preferably by reindexing the
data, either from original source, or reindexing from the broken index to a
new one (Perl and Ruby client libraries make this simple). If you feel like
it, you can always try to repair the shard.

Karel

On Friday, January 11, 2013 6:18:53 AM UTC+1, James Chisholm wrote:

Further to this - in the logs, it appears I'm not getting anything as
described in the EC2 guide
Elasticsearch Platform — Find real-time answers at scale | Elastic it says its reading the state from the gateway and recovering:

[gateway.s3] reading state from gateway [snip] ...
[gateway.s3] read state from gateway [snip], took 97ms

It's just creating a new, blank index with the same name.

On Friday, 11 January 2013 15:03:18 UTC+11, James Chisholm wrote:

Ok so we are back from our holiday break and need to get some resolution
on this. Whilst we appreciate that your suggestion to move away from S3 is
the correct one moving forward, it doesn't actually help us with this
problem.

What if we were to adjust our elasticsearch config to use local gateway,
and copy the contents of our S3 gateway into the local EBS volume on each
existing node, then start the cluster from there?

On Friday, 21 December 2012 22:14:47 UTC+11, Karel Minařík wrote:

Our staging cluster is in a different AWS region from production - which

is why the S3 gateway appealed.

If you're familiar with Chef, possibly, the updated version of tutorial
[1] explains how to use EBS-based gateway and EBS snapshots as a recovery
mechanism in an automated way.

AWS recently announced the "EBS snapshot copy" feature, so this may be a
nice way for you how to shuffle data between production and staging
environments.

Karel

1:
Elasticsearch Platform — Find real-time answers at scale | Elastic
2:
Announcing EBS Snapshot Copy, a Step towards Easier Data Center Migration and Disaster Recovery

--

I am attempting to migrate our production cluster's index to our staging
cluster, by copying the contents of the gateway snapshot on S3 from the
production bucket to the staging bucket.

At

where it says "Simulating a Total Cluster Failure" - start a new node it
will restore the data from the gateway. This isn't happening - the new
cluster starts up, creates the index we want but with no data in it -
despite there being 27gb of data in the gateway from a confirmed gateway
snapshot. Moments after this occurs, the S3 gateway gets over-written with
new files containing zero data. At no point do we see in the logs that it's
recovering the index from s3. We have previously used this method to
restore our data from a s3 gateway snapshot taken before a split brain
scenario, and been successful in doing so - as well as migrate our
production cluster data to staging in this manner.

I have tried updating to 0.19.12 to see if this rectifies it, but with no
success. This is the log from the master node - in logging.yml everything i
have set gateway, index.gateway and index.shard.recovery to TRACE levels.
Happy to post the elasticsearch.yml config file if that will help.

[2013-01-13 21:58:10,904][INFO ][node ] [es1-stg]
{0.19.12}[7976]: initialized
[2013-01-13 21:58:10,904][INFO ][node ] [es1-stg]
{0.19.12}[7976]: starting ...
[2013-01-13 21:58:11,229][INFO ][transport ] [es1-stg]
bound_address {inet[/10.139.47.41:9300]}, publish_address
{inet[/10.139.47.41:9300]}
[2013-01-13 21:58:41,291][WARN ][discovery ] [es1-stg]
waited for 30s and no initial state was set by the discovery
[2013-01-13 21:58:41,291][INFO ][discovery ] [es1-stg]
staging-01/jnSS4AjdRaKb01T8fzDSYQ
[2013-01-13 21:58:41,297][INFO ][http ] [es1-stg]
bound_address {inet[/10.139.47.41:9200]}, publish_address
{inet[/10.139.47.41:9200]}
[2013-01-13 21:58:41,298][INFO ][node ] [es1-stg]
{0.19.12}[7976]: started
[2013-01-13 21:58:42,743][INFO ][cluster.service ] [es1-stg]
new_master
[es1-stg][jnSS4AjdRaKb01T8fzDSYQ][inet[/10.139.47.41:9300]]{aws_availability_zone=ap-southeast-1a,
master=true}, reason: zen-disco-join (elected_as_master)
[2013-01-13 21:59:44,184][INFO ][gateway ] [es1-stg]
recovered [1] indices into cluster_state

[2013-01-13 22:12:59,325][INFO ][cluster.service ] [es1-stg] added
{[es3-stg][20snfXdkSbSYflUeJbOk0Q][inet[/10.139.55.190:9300]]{aws_availability_zone=ap-southeast-1a,
master=true},}, reason: zen-disco-receive(join from
node[[es3-stg][20snfXdkSbSYflUeJbOk0Q][inet[/10.139.55.190:9300]]{aws_availability_zone=ap-southeast-1a,
master=true}])
[2013-01-13 22:13:25,128][INFO ][cluster.service ] [es1-stg] added
{[es2-stg][Xh_sWfskSIq9Gxsmv6o9OQ][inet[/10.142.174.77:9300]]{aws_availability_zone=ap-southeast-1b,
master=true},}, reason: zen-disco-receive(join from
node[[es2-stg][Xh_sWfskSIq9Gxsmv6o9OQ][inet[/10.142.174.77:9300]]{aws_availability_zone=ap-southeast-1b,
master=true}])

Note how where I've highlighted the line above, it doesn't indicate that
it's reading data from the gateway, as indicated in the documentation I
linked above. Indeed, in previous time's I've used this method to recover
(which I think was on version 0.18.7), it's logged that it was reading the
state and recovering. Within seconds of that message above being printed,
the data in the S3 bucket goes from 27gb (which, incidentally, is 20 1.6gb
shards) to 5mb. It's clearly getting some informatoin, as the index is
created with the same number of shards (20) and replicas (3) as previously.

In case it was something wrong with the gateway snapshot, I've tested
multiple gateway snapshots which I have copied to another bucket - it's
occurring on all of the copies of the gateway snapshots which I have
taken!

I haven't had a chance to follow the steps you posted to move away from the
S3 gateway to local, as our cluster is in a broken state currently and the
only gateway snapshot we have is on S3. The idea of re-indexing from our
cassandra database really doesn't appeal.

On Friday, 11 January 2013 22:21:36 UTC+11, Karel Minařík wrote:

A bit confused about what are you trying to do actually. I still don't
know if you followed the steps I have posted or "sync'd the S3 gateway data
to the new EBS volume" via some other means.

All in all, if you're stuck with a broken shard, there's a really big
chance you need to get rid of that shard, preferably by reindexing the
data, either from original source, or reindexing from the broken index to a
new one (Perl and Ruby client libraries make this simple). If you feel like
it, you can always try to repair the shard.

Karel

On Friday, January 11, 2013 6:18:53 AM UTC+1, James Chisholm wrote:

Further to this - in the logs, it appears I'm not getting anything as
described in the EC2 guide
Elasticsearch Platform — Find real-time answers at scale | Elastic it says its reading the state from the gateway and recovering:

[gateway.s3] reading state from gateway [snip] ...
[gateway.s3] read state from gateway [snip], took 97ms

It's just creating a new, blank index with the same name.

On Friday, 11 January 2013 15:03:18 UTC+11, James Chisholm wrote:

Ok so we are back from our holiday break and need to get some resolution
on this. Whilst we appreciate that your suggestion to move away from S3 is
the correct one moving forward, it doesn't actually help us with this
problem.

What if we were to adjust our elasticsearch config to use local gateway,
and copy the contents of our S3 gateway into the local EBS volume on each
existing node, then start the cluster from there?

On Friday, 21 December 2012 22:14:47 UTC+11, Karel Minařík wrote:

Our staging cluster is in a different AWS region from production -

which is why the S3 gateway appealed.

If you're familiar with Chef, possibly, the updated version of tutorial
[1] explains how to use EBS-based gateway and EBS snapshots as a recovery
mechanism in an automated way.

AWS recently announced the "EBS snapshot copy" feature, so this may be
a nice way for you how to shuffle data between production and staging
environments.

Karel

1:
Elasticsearch Platform — Find real-time answers at scale | Elastic
2:
Announcing EBS Snapshot Copy, a Step towards Easier Data Center Migration and Disaster Recovery

--

Amongst other things today I moved the data from S3 to a shared FS gateway
and the same thing has occurred. It reads the metadata, creates a blank
index, and nukes the content of the gateway.

On Monday, 14 January 2013 16:09:53 UTC+11, James Chisholm wrote:

I am attempting to migrate our production cluster's index to our staging
cluster, by copying the contents of the gateway snapshot on S3 from the
production bucket to the staging bucket.

At
Elasticsearch Platform — Find real-time answers at scale | Elastic it says "Simulating a Total Cluster Failure" - start a new node it
will restore the data from the gateway. This isn't happening - the new
cluster starts up, creates the index we want but with no data in it -
despite there being 27gb of data in the gateway from a confirmed gateway
snapshot. Moments after this occurs, the S3 gateway gets over-written with
new files containing zero data. At no point do we see in the logs that it's
recovering the index from s3. We have previously used this method to
restore our data from a s3 gateway snapshot taken before a split brain
scenario, and been successful in doing so - as well as migrate our
production cluster data to staging in this manner.

I have tried updating to 0.19.12 to see if this rectifies it, but with no
success. This is the log from the master node - in logging.yml everything i
have set gateway, index.gateway and index.shard.recovery to TRACE levels.
Happy to post the elasticsearch.yml config file if that will help.

[2013-01-13 21:58:10,904][INFO ][node ] [es1-stg]
{0.19.12}[7976]: initialized
[2013-01-13 21:58:10,904][INFO ][node ] [es1-stg]
{0.19.12}[7976]: starting ...
[2013-01-13 21:58:11,229][INFO ][transport ] [es1-stg]
bound_address {inet[/10.139.47.41:9300]}, publish_address {inet[/
10.139.47.41:9300]}
[2013-01-13 21:58:41,291][WARN ][discovery ] [es1-stg]
waited for 30s and no initial state was set by the discovery
[2013-01-13 21:58:41,291][INFO ][discovery ] [es1-stg]
staging-01/jnSS4AjdRaKb01T8fzDSYQ
[2013-01-13 21:58:41,297][INFO ][http ] [es1-stg]
bound_address {inet[/10.139.47.41:9200]}, publish_address {inet[/
10.139.47.41:9200]}
[2013-01-13 21:58:41,298][INFO ][node ] [es1-stg]
{0.19.12}[7976]: started
[2013-01-13 21:58:42,743][INFO ][cluster.service ] [es1-stg]
new_master
[es1-stg][jnSS4AjdRaKb01T8fzDSYQ][inet[/10.139.47.41:9300]]{aws_availability_zone=ap-southeast-1a,
master=true}, reason: zen-disco-join (elected_as_master)
[2013-01-13 21:59:44,184][INFO ][gateway ] [es1-stg]
recovered [1] indices into cluster_state

[2013-01-13 22:12:59,325][INFO ][cluster.service ] [es1-stg]
added
{[es3-stg][20snfXdkSbSYflUeJbOk0Q][inet[/10.139.55.190:9300]]{aws_availability_zone=ap-southeast-1a,
master=true},}, reason: zen-disco-receive(join from
node[[es3-stg][20snfXdkSbSYflUeJbOk0Q][inet[/10.139.55.190:9300]]{aws_availability_zone=ap-southeast-1a,
master=true}])
[2013-01-13 22:13:25,128][INFO ][cluster.service ] [es1-stg]
added
{[es2-stg][Xh_sWfskSIq9Gxsmv6o9OQ][inet[/10.142.174.77:9300]]{aws_availability_zone=ap-southeast-1b,
master=true},}, reason: zen-disco-receive(join from
node[[es2-stg][Xh_sWfskSIq9Gxsmv6o9OQ][inet[/10.142.174.77:9300]]{aws_availability_zone=ap-southeast-1b,
master=true}])

Note how where I've highlighted the line above, it doesn't indicate that
it's reading data from the gateway, as indicated in the documentation I
linked above. Indeed, in previous time's I've used this method to recover
(which I think was on version 0.18.7), it's logged that it was reading the
state and recovering. Within seconds of that message above being printed,
the data in the S3 bucket goes from 27gb (which, incidentally, is 20 1.6gb
shards) to 5mb. It's clearly getting some informatoin, as the index is
created with the same number of shards (20) and replicas (3) as previously.

In case it was something wrong with the gateway snapshot, I've tested
multiple gateway snapshots which I have copied to another bucket - it's
occurring on all of the copies of the gateway snapshots which I have
taken!

I haven't had a chance to follow the steps you posted to move away from
the S3 gateway to local, as our cluster is in a broken state currently and
the only gateway snapshot we have is on S3. The idea of re-indexing from
our cassandra database really doesn't appeal.

On Friday, 11 January 2013 22:21:36 UTC+11, Karel Minařík wrote:

A bit confused about what are you trying to do actually. I still don't
know if you followed the steps I have posted or "sync'd the S3 gateway data
to the new EBS volume" via some other means.

All in all, if you're stuck with a broken shard, there's a really big
chance you need to get rid of that shard, preferably by reindexing the
data, either from original source, or reindexing from the broken index to a
new one (Perl and Ruby client libraries make this simple). If you feel like
it, you can always try to repair the shard.

Karel

On Friday, January 11, 2013 6:18:53 AM UTC+1, James Chisholm wrote:

Further to this - in the logs, it appears I'm not getting anything as
described in the EC2 guide
Elasticsearch Platform — Find real-time answers at scale | Elastic it says its reading the state from the gateway and recovering:

[gateway.s3] reading state from gateway [snip] ...
[gateway.s3] read state from gateway [snip], took 97ms

It's just creating a new, blank index with the same name.

On Friday, 11 January 2013 15:03:18 UTC+11, James Chisholm wrote:

Ok so we are back from our holiday break and need to get some
resolution on this. Whilst we appreciate that your suggestion to move away
from S3 is the correct one moving forward, it doesn't actually help us with
this problem.

What if we were to adjust our elasticsearch config to use local
gateway, and copy the contents of our S3 gateway into the local EBS volume
on each existing node, then start the cluster from there?

On Friday, 21 December 2012 22:14:47 UTC+11, Karel Minařík wrote:

Our staging cluster is in a different AWS region from production -

which is why the S3 gateway appealed.

If you're familiar with Chef, possibly, the updated version of
tutorial [1] explains how to use EBS-based gateway and EBS snapshots as a
recovery mechanism in an automated way.

AWS recently announced the "EBS snapshot copy" feature, so this may be
a nice way for you how to shuffle data between production and staging
environments.

Karel

1:
Elasticsearch Platform — Find real-time answers at scale | Elastic
2:
Announcing EBS Snapshot Copy, a Step towards Easier Data Center Migration and Disaster Recovery

--