We are running elastic on a kubernetes cluster in AWS. We have created a volume snapshot of elastic pvcs using the amazon volume snapshotting method:
There are 7 nodes in the cluster running in kubernetes. Each node is backed by a persistent volume claim. We have created a snapshot of each pvc.
Then when we want to use the snapshot (following a fresh install of our application) we create a new pvc from the snapshot. Elastic picks up the new pvc.
However, there seems to be a lot of shard activity that has to complete post installation. It takes 3-4 hours for this to complete. I am wondering why this happens. Note that the nodes here are dynamically created, you won't always have the same node name. Does it have to recreate all shards and replicas because the node names have changed? Even though there is already a replica/shard actually on the pvc?
Using the snapshot and restore API is, as specified in the documentation, the only supported way to back up data in Elasticsearch clusters. EBS snapshots is not a supported backup method and there is no guarantee that this will work at all. I believe I have seen it work with older versions of Elasticsearch, but a lot of resiliency improvements and consistency checks went into Elasticsearch 7.0 onwards which as far as I know makes it a lot less likely to work.
We are working with very large volumes of data, 8 billion documents. EBS snapshots were the quickest way we could find.
We were using the Elastic snapshot/restore API to backup to an EFS mount. So the data had to be copied from the PVCs to EFS, and then copied back when you wanted to restore the data: EFS copied to PVC. EBS volume snapshots are a quicker solution. Creating a PVC from am EBS snapshot is pretty much instantaneous.
Elastic does seem to work ok, it just has to do all the shard copying as I mentioned.
I do not know exactly under which conditions restoring from EBS snapshots may or may not work, but have seen issues with it so would not recommend relying on it.
If you don't care whether it works correctly or not then I'm pretty sure there are even quicker solutions than this The docs Christian linked are correct, there's no supported method to restore from this kind of snapshot:
Taking a snapshot is the only reliable and supported way to back up a cluster. You cannot back up an Elasticsearch cluster by making copies of the data directories of its nodes. There are no supported methods to restore any data from a filesystem-level backup. If you try to restore a cluster from such a backup, it may fail with reports of corruption or missing files or other data inconsistencies, or it may appear to have succeeded having silently lost some of your data.
A copy of the data directories of a cluster’s nodes does not work as a backup because it is not a consistent representation of their contents at a single point in time. You cannot fix this by shutting down nodes while making the copies, nor by taking atomic filesystem-level snapshots, because Elasticsearch has consistency requirements that span the whole cluster. You must use the built-in snapshot functionality for cluster backups.
It might appear to work most of the time, but I doubt "works most of the time" is good enough.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.