Elasticsearch stuck applyingchanges and reconciliation ended with failed predicates

ECK 1.0.1
k8s 1.16.9

The ECK operator's been stellar. But I ran into trouble deploying node resource and count changes at the same time.

In one cluster, it seems to have worked as expected, but in another the elasticsearch object is stuck applying changes:

  NAME         HEALTH   NODES   VERSION   PHASE             AGE
  es-cluster   green    17      7.9.3     ApplyingChanges   321d

The change requested increased the node count from 17 to 23 total and changed resources on the existing 17 nodes.
New nodes were successfully added, but there was a failed_predicates in each reconciliation attempt:

  do_not_restart_healthy_node_if_MaxUnavailable_reached

And it listed all pre-existing data and master nodes as causes for failure

I was using the default Update Strategy and change budget, so it should have been able to add all new nodes immediately and terminate 1 node at a time. But it didn't attempt to terminate any existing nodes.
And after 29 reconciliation attempts over ~60 seconds, it stopped trying.

Is there a bug or known limitation in making both changes at the same time with that update strategy?
Is there a way to kick start the watcher again?

I've tried manually restarting nodes and it has no affect on the elasticsearch object

Thanks in advance!

We found the operator had crashed because of resources. Fixing that solved everything

1 Like

Thanks for the update and sorry for not helping earlier.

Could you share more information about your use case? How many Elastic Stack components (Elasticsearch, Kibana, APM Server, Enterprise Search, and Beats) and how many nodes per component are managed by the ECK operator?

In this specific k8s cluster, there was only an Elasticsearch component. Initially, 3 master and 14 data nodes. No other components were deployed, but the operator OOM crashed when we added 6 more data nodes

Initial resources:
limits:
cpu: 500m
memory: 500Mi
requests:
cpu: 100m
memory: 250Mi

Now:
limits:
cpu: 500m
memory: 750Mi
requests:
cpu: 100m
memory: 1500Mi