Cluster red state after 1.4 to 1.7 update?


We are preparing to update our server ES version from 1.4.4 to 1.7.3. We had an understanding in talking to various people that this should be relatively straightforward and low risk, if accomplished via a cluster rolling restart. We did the update in our (much smaller) testing environment, and the cluster state went red for several minutes, which gives us pause in preparing to do the same in production.

Here are the relevant log lines from the first node, in testing, that was restarted, after which time it went into a red state for around 15 minutes.

The worrying bit to me is the ElasticsearchIllegalStateException. Our production cluster has 3923 total shards running on 20 nodes, with 85 TB of data. We had been planning to accomplish the update by a rolling restart of the cluster. But I wanted to make sure we weren't setting ourselves up for a long downtime while this process happens. Any insight is appreciated.

Hi jeffevans,

How many nodes did you have in the test cluster. And i hope you have read the upgrade notes on upgrading from 1.4 to 1.7

This upgrade updates one node at a time and thus avoid downtime.



There are 3 nodes. I checked with our system admin, and the exact rolling restart procedures you linked to were followed, except that the disabling of shard reallocation was not done (and also, the cluster did not attempt to reallocate any shards during the restart). So here was the exact sequence.

  1. Node 1 restart
  2. Cluster state goes to red for a while
  3. Node 2 restart
  4. Cluster state turned to green
  5. Node 3 restart

Here is the full log output from all 3 nodes during the rolling restart. The time of concern is from 11:46 until 11:54 when Node 1 ("stage-es1") failed to join the cluster, which is I believe why it was in red state.

Seems the process you followed is good enough for the upgrade , just i had the shards disabled in my case. Also as a past experience i have seen that if you are on AWS and use the cloud-aws plugin , the cluster join is much faster as compared to zen. The failure of contacting can be the network issue as well, you can go with the production upgrade i feel.