ES 5.4.0 -> 5.4.2 + X-Pack Upgrade: Perpetually Initializing Shards

todd-bowles-console · June 26, 2017, 7:09am

I'm in the process of upgrading our ELK stack to have X-Pack.

The following things about our ELK stack might be relevant:

Its currently 5.4.0, and I planned to upgrade to 5.4.2 at the same time as I install X-Pack
It runs in AWS on EC2 instances created from custom AMIs (which are Amazon Linux baseline + ES (and plugins) installed via yum. Plugins are: discovery-ec2, mapper-size, repository-s3
We have 3 environments, CI, Staging and Production (which follow a normal development/deployment pipeline)
We use Cerebro to give some visibility to how the stack is operating

My plan was as follows:

Create a new AMI with ES 5.4.2 and X-Pack on it
Roll out an update to AWS via CloudFormation, which will create new EC2 instances and decommission old ones as each new instance comes online and joins the cluster

I'm currently struggling with one particular issue in testing the update to the CI environment (which has 1 Master, 2 Data, 0 Ingest, 0 Client and a relatively small number of hourly logstash indexes with a few MB of data each).

In running the update the Master Node is successfully replaced with a new node of the appropriate version, and configuration is deployed to disable the X-Pack features I don't have a licence for (i.e. everything except monitoring).

The replacement of the data nodes is not working however.

A new node is brought online, it joins the cluster and shards are allocated to it. However, those shards never exit the INITIALIZING state. At the same time, the new indexes (i.e. the es monitoring indexes) are created successfully, and the shards live successfully on the new node, but not the old one (because 5.4.2 -> 5.4.0 shard allocation is invalid, which I understand).

There are no log entries that say anything about why the shards are stuck in INITIALIZING, at least as far as I can see.

Additionally, I'm getting a very weird problem with Cerebro (0.6.5), where it complains about a unique identifier not existing in its attempts to compile overview information.

From the small amount of investigation I did, it looked like the /_nodes endpoint was constantly returning 3 nodes (master, old data, new data) and then 1 second later, 2 nodes (master, old data), in a perpetual cycle.

Again, there are no logs on the new data node that state that its leaving/entering the cluster.

I have tried:

Restarting the new data node

At this point, I'm at a bit of a loss, so any help or ideas would be appreciated. The environment is currently in this broken state, and I think I can recreate it completely and reproduce at will (but I don't know for certain yet, I'm still waiting for this update to timeout).

timost · June 26, 2017, 7:56am

Not sure if this can help but here are a few things you can try.

Double check your logs:
- Do you have a custom logging config ? It's weird that you cannot see any trace of the node leaving/joining the cluster in logs.
- Did you check the logs on the other nodes of the cluster ?
On your new data node:
- Check it has enough Heap space allocated.
- Check it has enough disk space.
- Did you try running the elasticsearch server manually from the command line, it might output some useful information.
I'm not sure if this works on initializing shards but you could try allocating those shards manually using the reroute api in dry run mode and check the explanations it returns. It might tell you what's preventing the shards from being allocated.

warkolm · June 26, 2017, 8:43am

If you are installing X-Pack for the first time on an existing cluster, you must perform a full cluster restart. Installing X-Pack enables security and security must be enabled on ALL nodes in a cluster for the cluster to operate correctly. When upgrading you can usually perform a rolling upgrade.

There's currently no other option sorry!

todd-bowles-console · June 27, 2017, 12:16am

I get that, but everything except monitoring is disabled before the replacement nodes start, so security shouldn't be causing an issue.

The documentation for a full cluster restart sounds pretty...manual. I've been trying to automate as much as possible using AWS CloudFormation and Octopus Deploy, so I don't have to repeat the same thing three times for our 3 environments (and so its completely reproducible for the eventual production upgrade).

Part of this involves spinning up new EC2 instances with new Elasticsearch on them, getting them to join the cluster, waiting for green and then killing the old ones, but it sounds like the recommended approach is to upgrade existing instances (after wrapping the whole thing in a bunch of setup, including shutting down the entire cluster).

todd-bowles-console · June 27, 2017, 1:21am

Looking into the logs on all of the nodes in the cluster:

New master: Has errors relating to the es-monitoring indexes created by X-Pack, complaining about no active primary shard. Could be related to problems with allocating shards across 5.4.0, 5.4.2 instances.
Old data: Has a bunch of errors in its logs about failures connecting to the old nodes in the cluster (the old master and the other data node that were terminated). Not sure why it would be trying to connect to those nodes. They don't exist anymore.
New data: I deleted the stack before I checked this one unfortunately, as I'm trying out some things.

I'm about to try a slightly different tact.

I'm going to upgrade to 5.4.2 first THEN run an upgrade to introduce X-Pack, to see if it helps.

system · July 25, 2017, 1:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.