Error message I can't explain when using elasticsearch Ruby gem

Shub · August 24, 2015, 12:59pm

Hey folks,
I could use a bit of help working out what's wrong with my use of the elasticsearch Ruby gem. Basically I put together a set of simple scripts to automate dumping an index or a set of indices in one environment to a snapshot and restoring the snapshot to a different environment, which is something we have to do quite often, and doing it in Ruby seemed more elegant -- and easier for me -- than doing it in Bash with curl.

Overall, my scripts work fine, but I always have this weird issue when working with larger indices. Since restoring those large indices was the whole basis for developing those scripts, I'm kind of in a bind. To be clear, the scripts work fine with small indices (<5 GB) but I'm working with a 120-GB, 200-million-document index in this instance.

The overview of what my script does is this:

Create a repo by the specified name if it doesn't exist
Dump a snapshot with the specified name to the specified repo, and append a timestamp to the snapshot's name
When the snapshot is done, attempt to restore it to the other cluster, and append the same timestamp to the names of the indices
Create an alias foo for the index foo_<timestamp> so our app can find it.

First, here's my Ruby library. I'm a newbie so if you think my code is horrible, that's expected.
http://pastebin.com/ZX9Tfxpp

And here's the script that's actually doing things:
http://pastebin.com/XSzDkha9

And lastly, here's the error I get:

/Library/Ruby/Gems/2.0.0/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/transport/base.rb:135:in `__raise_transport_error': [503] {"error":"ConcurrentSnapshotExecutionException[[foo:mysnapshot_20150824_0800] a snapshot is already running]","status":503} (Elasticsearch::Transport::Transport::Errors::ServiceUnavailable)
    from /Library/Ruby/Gems/2.0.0/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/transport/base.rb:227:in `perform_request'
    from /Library/Ruby/Gems/2.0.0/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
    from /Library/Ruby/Gems/2.0.0/gems/elasticsearch-transport-1.0.12/lib/elasticsearch/transport/client.rb:119:in `perform_request'
    from /Library/Ruby/Gems/2.0.0/gems/elasticsearch-api-1.0.12/lib/elasticsearch/api/namespace/common.rb:21:in `perform_request'
    from /Library/Ruby/Gems/2.0.0/gems/elasticsearch-api-1.0.12/lib/elasticsearch/api/actions/snapshot/create.rb:43:in `create'
    from /Users/fdelpierre/work/snappy/snappy.rb:172:in `es_create_snapshot'
    from /Users/fdelpierre/work/snappy/snappy.rb:302:in `snappy'
    from /Users/fdelpierre/work/snappy/test.rb:38:in `<main>'
    Start time: 2015-08-24 08:00:32 -0400
    Specified repo 'foo' exists, skipping creation...
[Finished in 61.8s with exit code 1]
[shell_cmd: ruby "/Users/me/work/snappy/test.rb"]
[dir: /Users/me/work/snappy]
[path: /usr/bin:/bin:/usr/sbin:/sbin]

If it matters, that was me testing the script on OS X 10.10 with the native Ruby 2.0, but the same thing happens in our Jenkins server with rbenv and Ruby 2.2.

Despite the above error, the snapshot does get created fine, i.e. the create_snapshot operation continues fine after the error message occurs and drops me back to the shell, and I can then restore the snapshot manually to the destination cluster, after the source cluster is done dumping the data. But during the script's execution, it looks like something tries to... do something that makes the ES cluster say "hey dumb-dumb, you can't do that, I'm already dumping a snapshot with that name", and Ruby throws out an exception. I could add some error handling to catch it but it seems like a bad idea since I don't know why it's throwing out that exception in the first place.

I'm "fine" dumping and restoring snapshots via curl or the Kopf plugin, but we (DevOps) are trying to make this a push-button operation for the QA team. I figure something is wrong with my code rather than the gem or Elasticsearch itself, but I can't figure it out. I would greatly appreciate any help.

Shub · August 27, 2015, 2:54pm

Well, in case anybody cares, I worked around it by setting wait_for_completion to false in my code and tweaking and adding code here and there to check on the status of the snapshot while it's being dumped, so the script knows when it's done baking and can proceed with restoring it to the destination cluster.