Elasticsearch pod stuck in CrashLoopBackOff

Hi all,

So to outline the problem, there is a StatefulSet which specifies ES 7.4 as the container image for a set of 3 pods. 2/3 of these pods are running perfectly fine on version 7.4 however the last of the pods has somehow tried to upgrade to version 7.6? This is the current error message every time the pod tries to start up:

{"type": "server", "timestamp": "2020-10-02T02:02:37,799Z", "level": "WARN", "component": "o.e.b.ElasticsearchUncaughtExceptionHandler", "cluster.name": "search", "node.name": "search-es-1", "message": "uncaught exception in thread [main]",
"stacktrace": ["org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: cannot downgrade a node from version [7.6.0] to version [7.4.0]",
"at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:125) ~[elasticsearch-cli-7.4.0.jar:7.4.0]",
"at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.4.0.jar:7.4.0]",
"at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.4.0.jar:7.4.0]",
"Caused by: java.lang.IllegalStateException: cannot downgrade a node from version [7.6.0] to version [7.4.0]",
"at org.elasticsearch.env.NodeMetaData.upgradeToCurrentVersion(NodeMetaData.java:94) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.env.NodeEnvironment.loadOrCreateNodeMetaData(NodeEnvironment.java:426) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:304) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.node.Node.<init>(Node.java:275) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.node.Node.<init>(Node.java:255) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.4.0.jar:7.4.0]",
"at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.4.0.jar:7.4.0]",
"... 6 more"] }

Now I'm no K8s expert but what is the best way I can get this pod back up and running on version 7.4? I dont care about the data on the node as the cluster is currently not being used, so can I delete the PVC and the Pod and then will the StatefulSet spin up a fresh PVC running 7.4 again or is this something I would need to do manually?

I'm also wondering how this happened in this first place? Has the ECK operator tried to automatically upgrade the 7.4 cluster to run 7.6?? To elaborate further, the cluster has ECK operator 1.0.1 running and there is another separate ES cluster running for aggregated logging which IS on version 7.6 of ES.

Any help/advice would be much appreciated.

How are your PVs being created? We've seen similar issues in the past where PVs were being reused without being properly scrubbed of data first. In this case it looks like the 7.4 pod is getting a PV that already has 7.6 data on it.

Thanks for the response Anya. I think you're probably right.

The broken Pod in question has a volume elasticsearch-data which refers to this PVC elasticsearch-data-search-es-1. I believe we use dynamically provisioned Azure disk. Here's the output from kubectl describe pvc for the one used by the pod:

Name:          elasticsearch-data-search-es-1
Namespace:     default
StorageClass:  managed-premium
Status:        Bound
Volume:        pvc-123d80c1-1f91-11ea-84d3-7e3467d53186
Labels:        common.k8s.elastic.co/type=elasticsearch
               elasticsearch.k8s.elastic.co/cluster-name=search
               elasticsearch.k8s.elastic.co/statefulset-name=search-es
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      30Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    search-es-1

And here is a kubectl describe pv for the PV itself:

Name:            pvc-123d80c1-1f91-11ea-84d3-7e3467d53186
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller: yes
                 pv.kubernetes.io/provisioned-by: kubernetes.io/azure-disk
                 volumehelper.VolumeDynamicallyCreatedByKey: azure-disk-dynamic-provisioner
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    managed-premium
Status:          Bound
Claim:           default/elasticsearch-data-search-es-1
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        30Gi
Node Affinity:   <none>
Message:
Source:
    Type:         AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
    DiskName:     kubernetes-dynamic-pvc-123d80c1-1f91-11ea-84d3-7e3467d53186
    DiskURI:      <redacted>
    Kind:         Managed
    FSType:
    CachingMode:  ReadOnly
    ReadOnly:     false

Hope that helps in regards to how the PVs are created? Is there a kubectl command I can run to purge the PVC or reset it back to a 7.4 version?

Dave

That's very odd -- the dynamic provisioner should handle it automatically for you. I wonder if you can see anything odd with the creation times? In normal operation the PV should be created shortly after the PVC (which itself would be created shortly after the Elasticsearch resource and statefulset). I'd be curious what you find.

If you do not care about the data in the whole cluster, the easiest fix would be to delete the Elasticsearch resource and re-create it, which should delete the existing PVs (since the reclaim policy of the storage class is delete) and then create new ones.

As an aside, we have shipped quite a few fixes and upgrades in ECK since 1.0.1 and would recommend upgrading -- the upgrade path is straightforward for most use cases. I don't remember any known issues that would relate to this though.

Interesting...I checked the created times and they all say 292d so I assume they were all created at similar times and in the correct order.

How would I go about "deleting the Elasticsearch resource" ? Is it as simple as kubectl delete elasticsearch <cluster-name> or will I be better off deleting everything manually? Is there some documentation you could please point me to?

And I will also look into upgrading ECK at some stage once I've got this first issue resolved.

Thanks for your help and advice so far Anya.