Update elasticsearch cluster hardware with no downtime

Hi All,

I have elasticsearch 0.90 cluster running a single node on each of two Amazon instances.
The cluster is configured for:
5 shards / 2 replica

From my understanding of how elasticsearch works, i should be able to upgrade the Amazon instances to ones with more memory without creating any downtime for my clients (the clients are aware of both nodes). This should also be simple as I don't plan to upgrade the elasticsearch version yet (challenge for another day). But because the downtime can be pretty catastrophic for some clients, i wanted to run by the following approach to make sure it makes sense:

  1. bring down elasticsearch running on one instance. : curl -XPOST 'http://localhost:9200/_cluster/nodes/_local/_shutdown'
  2. wait some time?
  3. upgrade the instance to have more memory.
  4. bring the new instance back up.
  5. wait some time for cluster to be green again.
  6. repeat with other instance.

After i bring down the first node - is there any need to wait? i assume once the node is down, i might as well just bring down the instance altogether.
If there's any gotchas to consider or if there is a better approach, please let me know.

thanks!
Ed

It all depends on how much you don't want there to be an outage. There is an allocation api that lets you push the shards off of a node. You can bring up a new instance and have it join the cluster then use the API to move the shards off of an old node, wait for there to be no shards on the node, and then shut the old node down for good. Repeat for the next node. That'll give you 0 time where you didn't have two way replication but it'll take more time.

If you are ok with dropping to a single copy of your data during the node upgrade you can do what you've proposed. Don't wait any time in step 2. This process is optimized somewhat in 1.6 and 1.7 to be faster but you have a long way to upgrade before you get there.

Good lock. I've done this dozens of times and it works. I used way number one a year ago when we were replacing hard drives our nodes.

That allocation api looks very useful.

if i use your method, i'd have to also make sure the clients were aware of the new (third) instance. Not very hard to do... but just another "todo" that i would have to keep in mind. I have to weigh the extra steps of your way vs how much i value living on one set of data for the upgrade window. Decisions, decisions..

Thanks for you help!