Rolling Upgrade?


(jjasinek) #1

I know I have seen this question asked before but I couldn't find a solid
answer for it in the group, so I am probably repeating the question. I
wanted to know what the recommend solution, if there is one, is in regards
to upgrading ElasticSearch versions. I feel that a rolling upgrade can be
performed against a production environment without any data lost but I
can't say that I am 100% confident in it; mostly due to a lack of
experience with doing it. Let me describe our setup and our proposed
solution to see if you all agree that this is the best route to go.

The Scenario

n our situation we have two 3 node clusters running in multiple data
centers for redundancy purposes with exact data being fed into the two
systems. Each cluster contains one index (aliased of course) with 8 shards
and 2 replicas for a total of 24 replicas/shards/chunks. Cluster settings
are minimum_master_nodes = 2, local storage, recover_after_master_nodes =
2, recorver+after_time = 5m, and expected_nodes = 3. All nodes can be a
master node.

Upgrade Process #1 - Rolling Upgrade

If we were to ever perform an upgrade I think I would prefer a rolling
upgrade as opposed to using the Shutdown API so that our system is live all
of the time. I'm just afraid that there will be a point when we do this
that the system will be down based on our settings to avoid split-brain.
This will probably happen because based on ElasticSearch version
compatibility to create a cluster, minimum_master_nodes at quorum - 1,
there will be a point when we are split and a system is down. The downside
to this is that there would be 'downtime' for the amount of time it takes
to upgrade one server. I assume behavior would be something like:

  1. Upgrade ElasticSearch on Server One (stop service, install new
    binaries, start service)
  2. Replicas will balance between Server Two and Server Three.
    2. Server One will not join the current cluster due to version
    differences
    3. Server One will wait in 'recovery' until at least 2 master nodes
    are available
    4. Server Two and Three remain in a healthy cluster.
  3. Upgrade ElasticSearch on Server Two (stop service, install new
    binaries, start service)
    1. Replicas will now all be on Server Three.
    2. Server Two will form a cluster with Server One and they will
      recovery indexes based on data they have in their data directories.
      1. Here is where I am not 100% sure if we have 100% of the data
        coverage. If we are evenly balanced than their should be at least one copy
        of each shard in each data directory.
    3. Server Three isn't healthy anymore since there aren't enough
      masters available.
  4. Upgrade ElasticSearch on Server Three (stop service, install new
    binaries, start service)
    1. Server Three will join cluster with Server Two and Server One.
    2. Replicas will get pushed to Server Three. Perhaps not fresh
      copies if it had a portion of existing replica in data directory?

Upgrade Process #2 - Full Cluster Shutdown

The downside to this is that we might be 'down' the amount of time it takes
to upgrade two servers as opposed to one.

  1. Execute the Shutdown API to all servers so that no replica
    re-balancing takes place.
  2. Upgrade each server to the new version of ElasticSearch
  3. Bring them up one at a time

From the scenarios above it sounds like both options leaves us with a
slight amount of downtime where we couldn't read or write from the
cluster. For us, working in the finance industry, we can't miss any data
and we are constantly writing to ElasticSearch. Thankfully, since we have
to operate two data-centers that are independant of one another we have a
utility that compares the data that was recently inserted/updated (5-10
minute range searches) in ElasticSearch and attempts to merge data from one
cluster to another. With that utility I'm guaranteed not to have data
loss, but it still means that one of my data-centers would have to be taken
offline while we do this upgrade.

Any suggestions? What have you guys done?


(Shay Banon) #2

I did not understand why rolling upgrade will cause downtime? You have 3
nodes in the cluster, with minimum master nodes set to 2, so, if you
restart one node at a time you will be good to go. Note though, that a full
cluster shutdown is required when upgrading to a major new version (0.17 ->
0.18 for example).

On Tue, Nov 29, 2011 at 10:55 PM, jjasinek jjasinek@gmail.com wrote:

I know I have seen this question asked before but I couldn't find a solid
answer for it in the group, so I am probably repeating the question. I
wanted to know what the recommend solution, if there is one, is in regards
to upgrading ElasticSearch versions. I feel that a rolling upgrade can be
performed against a production environment without any data lost but I
can't say that I am 100% confident in it; mostly due to a lack of
experience with doing it. Let me describe our setup and our proposed
solution to see if you all agree that this is the best route to go.

The Scenario

n our situation we have two 3 node clusters running in multiple data
centers for redundancy purposes with exact data being fed into the two
systems. Each cluster contains one index (aliased of course) with 8 shards
and 2 replicas for a total of 24 replicas/shards/chunks. Cluster settings
are minimum_master_nodes = 2, local storage, recover_after_master_nodes =
2, recorver+after_time = 5m, and expected_nodes = 3. All nodes can be a
master node.

Upgrade Process #1 - Rolling Upgrade

If we were to ever perform an upgrade I think I would prefer a rolling
upgrade as opposed to using the Shutdown API so that our system is live all
of the time. I'm just afraid that there will be a point when we do this
that the system will be down based on our settings to avoid split-brain.
This will probably happen because based on ElasticSearch version
compatibility to create a cluster, minimum_master_nodes at quorum - 1,
there will be a point when we are split and a system is down. The downside
to this is that there would be 'downtime' for the amount of time it takes
to upgrade one server. I assume behavior would be something like:

  1. Upgrade ElasticSearch on Server One (stop service, install new
    binaries, start service)
  2. Replicas will balance between Server Two and Server Three.
    2. Server One will not join the current cluster due to version
    differences
    3. Server One will wait in 'recovery' until at least 2 master nodes
    are available
    4. Server Two and Three remain in a healthy cluster.
  3. Upgrade ElasticSearch on Server Two (stop service, install new
    binaries, start service)
    1. Replicas will now all be on Server Three.
    2. Server Two will form a cluster with Server One and they will
      recovery indexes based on data they have in their data directories.
      1. Here is where I am not 100% sure if we have 100% of the data
        coverage. If we are evenly balanced than their should be at least one copy
        of each shard in each data directory.
    3. Server Three isn't healthy anymore since there aren't enough
      masters available.
  4. Upgrade ElasticSearch on Server Three (stop service, install new
    binaries, start service)
    1. Server Three will join cluster with Server Two and Server One.
    2. Replicas will get pushed to Server Three. Perhaps not fresh
      copies if it had a portion of existing replica in data directory?

Upgrade Process #2 - Full Cluster Shutdown

The downside to this is that we might be 'down' the amount of time it
takes to upgrade two servers as opposed to one.

  1. Execute the Shutdown API to all servers so that no replica
    re-balancing takes place.
  2. Upgrade each server to the new version of ElasticSearch
  3. Bring them up one at a time

From the scenarios above it sounds like both options leaves us with a
slight amount of downtime where we couldn't read or write from the
cluster. For us, working in the finance industry, we can't miss any data
and we are constantly writing to ElasticSearch. Thankfully, since we have
to operate two data-centers that are independant of one another we have a
utility that compares the data that was recently inserted/updated (5-10
minute range searches) in ElasticSearch and attempts to merge data from one
cluster to another. With that utility I'm guaranteed not to have data
loss, but it still means that one of my data-centers would have to be taken
offline while we do this upgrade.

Any suggestions? What have you guys done?


(jjasinek) #3

In the first scenario there would be downtime when Server One was on .
18, Server Two was shutdown and being upgraded, and Server Three was
on .17. Because Server One and Server Three were on different
versions they couldn't form a cluster. And because the minimum master
nodes is set to two they probably were in a red/yellow state. I
assume when they are in that state that no read/write operations can
happen (maybe reads but I can't imagine writes are allowed). As such
there will be downtime until at least two masters can see each other.

Regardless, I think for major upgrades we are going to need to
shutdown and then do a data sync against our other data center until
Elastic Search can handle different versions joining clusters
together.

Thanks Shay.

On Nov 30, 8:59 am, Shay Banon kim...@gmail.com wrote:

I did not understand why rolling upgrade will cause downtime? You have 3
nodes in the cluster, with minimum master nodes set to 2, so, if you
restart one node at a time you will be good to go. Note though, that a full
cluster shutdown is required when upgrading to a major new version (0.17 ->
0.18 for example).

On Tue, Nov 29, 2011 at 10:55 PM, jjasinek jjasi...@gmail.com wrote:

I know I have seen this question asked before but I couldn't find a solid
answer for it in the group, so I am probably repeating the question. I
wanted to know what the recommend solution, if there is one, is in regards
to upgrading ElasticSearch versions. I feel that a rolling upgrade can be
performed against a production environment without any data lost but I
can't say that I am 100% confident in it; mostly due to a lack of
experience with doing it. Let me describe our setup and our proposed
solution to see if you all agree that this is the best route to go.

The Scenario

n our situation we have two 3 node clusters running in multiple data
centers for redundancy purposes with exact data being fed into the two
systems. Each cluster contains one index (aliased of course) with 8 shards
and 2 replicas for a total of 24 replicas/shards/chunks. Cluster settings
are minimum_master_nodes = 2, local storage, recover_after_master_nodes =
2, recorver+after_time = 5m, and expected_nodes = 3. All nodes can be a
master node.

Upgrade Process #1 - Rolling Upgrade

If we were to ever perform an upgrade I think I would prefer a rolling
upgrade as opposed to using the Shutdown API so that our system is live all
of the time. I'm just afraid that there will be a point when we do this
that the system will be down based on our settings to avoid split-brain.
This will probably happen because based on ElasticSearch version
compatibility to create a cluster, minimum_master_nodes at quorum - 1,
there will be a point when we are split and a system is down. The downside
to this is that there would be 'downtime' for the amount of time it takes
to upgrade one server. I assume behavior would be something like:

  1. Upgrade ElasticSearch on Server One (stop service, install new
    binaries, start service)
  2. Replicas will balance between Server Two and Server Three.
    2. Server One will not join the current cluster due to version
    differences
    3. Server One will wait in 'recovery' until at least 2 master nodes
    are available
    4. Server Two and Three remain in a healthy cluster.
  3. Upgrade ElasticSearch on Server Two (stop service, install new
    binaries, start service)
    1. Replicas will now all be on Server Three.
    2. Server Two will form a cluster with Server One and they will
      recovery indexes based on data they have in their data directories.
      1. Here is where I am not 100% sure if we have 100% of the data
        coverage. If we are evenly balanced than their should be at least one copy
        of each shard in each data directory.
    3. Server Three isn't healthy anymore since there aren't enough
      masters available.
  4. Upgrade ElasticSearch on Server Three (stop service, install new
    binaries, start service)
    1. Server Three will join cluster with Server Two and Server One.
    2. Replicas will get pushed to Server Three. Perhaps not fresh
      copies if it had a portion of existing replica in data directory?

Upgrade Process #2 - Full Cluster Shutdown

The downside to this is that we might be 'down' the amount of time it
takes to upgrade two servers as opposed to one.

  1. Execute the Shutdown API to all servers so that no replica
    re-balancing takes place.
  2. Upgrade each server to the new version of ElasticSearch
  3. Bring them up one at a time

From the scenarios above it sounds like both options leaves us with a
slight amount of downtime where we couldn't read or write from the
cluster. For us, working in the finance industry, we can't miss any data
and we are constantly writing to ElasticSearch. Thankfully, since we have
to operate two data-centers that are independant of one another we have a
utility that compares the data that was recently inserted/updated (5-10
minute range searches) in ElasticSearch and attempts to merge data from one
cluster to another. With that utility I'm guaranteed not to have data
loss, but it still means that one of my data-centers would have to be taken
offline while we do this upgrade.

Any suggestions? What have you guys done?


(system) #4