Does ECK support local persistent disks and is it a good idea?

I am using GKE and we'd like to build several Elasticsearch Clusters using ECK. Some of them will be rather big (10+ TB) and have a decent throughput (1+ TB per day), others are smaller.

We are wondering whether it's a good idea to use local persistent disk (instead of persistent volumes) as they offer way better performance characteristics at a cheaper price ($0.08 instead of $0.17 per SSD GB month). I am fully aware that the local persistent disk has a few downsides such as:

  • It is bound to a single Kubernetes node
  • It can not be resized
  • Data loss is possible as it's not redundant as opposed to persistent volumes
  • You must use a multiple of 375GB partitions and the performance scales with the size/number of the disk/partitions as well

However even after considering these downsides I think it could be a good idea to use them (maybe only for our hot nodes?). The resizing problem can be fixed by adding or removing ES nodes, Data loss risk can be reduced by using Elasticsearch replicas which also improves the read performance as replicas will be queried.

My only concern is whether ECK supports that? Whenever a node is removed the data must be transferred to a different node or it will be lost. As a node dies the disk is gone as well right? Are there any other concerns/issues which I should be aware of?

I am surprised the documentation barely discusses this topic given the potential performance impact.

ECK is agnostic to the type of storage used -- so if you can operate it with stateful sets, it should work with ECK. Performance wise the storage recommendations are no different than Elasticsearch in general; they are not different on Kubernetes.

Operating local volumes on Kubernetes is still a bit of a challenge, but one that is not specific to ECK and applies to all stateful workloads. Off the top of my head, this issue is still present and must be worked around (with a double delete, normally).

As a node dies the disk is gone as well right?

Yes. The corresponding Pod will stay Pending as it cannot be scheduled on any other Node with the same local PersistentVolume. At this point you can either:

  • attempt to recover the host and its data
  • manually delete the PersistentVolumeClaim and Pod, so a new Pod gets scheduled with an empty volume. This leads to data loss if you do not have replicas of the data on other Elasticsearch nodes.

Thanks for the quick response,
so a common operation such as a Rolling Node Upgrade (let's say because of a Kubernetes upgrade) would already cause a problem which the ECK can not handle.

I'd say it could be handled in a more graceful way by transfering the data from the leaving node to the new node, but I am not sure whether this is something the ECK could even be responsible for.

As far as I can an automatic data migration is currently not possible in Kubernetes with ECK because the pod won't be scheduled (because of the missing PV). Thus I don't see how one currently could use local storages in Kubernetes or am I missing something? This would be a pitty because local disk IOPS seem to be 10x higher for the same disk size of a persistent volume offered by Google.

@weeco you are correct.

Currently I think there are 2 ways to deal with local volumes and upgrading k8s hosts:

  1. get the same host back into the k8s cluster after the upgrade, in which case the Pod can start and use the volume again
  2. remove the host completely and replace it by a new one. In that case as you point out the Pod will stay Pending forever. The only way out is to manually delete it, along with its corresponding PVC. This is something you could automate, I guess, to auto-delete Pods whose nodes went out of the cluster. As long as you have replicas of the Elasticsearch data elsewhere, the new node will replicate the missing shards. This could potentially be automated in ECK, but is quite tricky to do in reality: how would we know whether a node will come back, and/or how long should we wait for it to come back before removing the Pod and creating a new one?

I created in issue in our Github repo: I think this is a reasonable feature to think about. Please feel free to post there if you have some ideas of how that should look like.