I'm in the process of upgrading our ELK stack to have X-Pack.
The following things about our ELK stack might be relevant:
- Its currently 5.4.0, and I planned to upgrade to 5.4.2 at the same time as I install X-Pack
- It runs in AWS on EC2 instances created from custom AMIs (which are Amazon Linux baseline + ES (and plugins) installed via yum. Plugins are: discovery-ec2, mapper-size, repository-s3
- We have 3 environments, CI, Staging and Production (which follow a normal development/deployment pipeline)
- We use Cerebro to give some visibility to how the stack is operating
My plan was as follows:
- Create a new AMI with ES 5.4.2 and X-Pack on it
- Roll out an update to AWS via CloudFormation, which will create new EC2 instances and decommission old ones as each new instance comes online and joins the cluster
I'm currently struggling with one particular issue in testing the update to the CI environment (which has 1 Master, 2 Data, 0 Ingest, 0 Client and a relatively small number of hourly logstash indexes with a few MB of data each).
In running the update the Master Node is successfully replaced with a new node of the appropriate version, and configuration is deployed to disable the X-Pack features I don't have a licence for (i.e. everything except monitoring).
The replacement of the data nodes is not working however.
A new node is brought online, it joins the cluster and shards are allocated to it. However, those shards never exit the INITIALIZING state. At the same time, the new indexes (i.e. the es monitoring indexes) are created successfully, and the shards live successfully on the new node, but not the old one (because 5.4.2 -> 5.4.0 shard allocation is invalid, which I understand).
There are no log entries that say anything about why the shards are stuck in INITIALIZING, at least as far as I can see.
Additionally, I'm getting a very weird problem with Cerebro (0.6.5), where it complains about a unique identifier not existing in its attempts to compile overview information.
From the small amount of investigation I did, it looked like the /_nodes endpoint was constantly returning 3 nodes (master, old data, new data) and then 1 second later, 2 nodes (master, old data), in a perpetual cycle.
Again, there are no logs on the new data node that state that its leaving/entering the cluster.
I have tried:
- Restarting the new data node
At this point, I'm at a bit of a loss, so any help or ideas would be appreciated. The environment is currently in this broken state, and I think I can recreate it completely and reproduce at will (but I don't know for certain yet, I'm still waiting for this update to timeout).