ECK cluster dies after 3-4 months and tries to automatically create new cluster

We've been using ECK late last year, started with the beta operator, moved to 1.0, and now we're on 1.1.2, each time doing a complete uninstall/reinstall of the operator, and have updated the cluster to various versions in between, currently on 7.8. About 3-4 times over the past year, usually about 3-4 months into the cluster working fine, it will randomly die in the middle of night, and it acts as though it's spinning up a new cluster, it releases it's PVC's and recreates all of it's secrets and tries to init a new cluster, which always fails, and stays in a failed state until I can delete it and re-recreate it. Has anyone else experienced this behavior? I would really like to understand why this is happening as it makes for some unpleasant late night calls.

Here are the logs from the operator when this begins to happen:

Have you by any chance copied any of the ECK created secrets to another namespace including the owner references? https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-common-problems.html#k8s-common-problems-owner-refs

1 Like

Wow, that's a fun bug. Yep, that's probably exactly what's been happening to us. So glad to finally have an answer to this crazy behavior. Thanks!