Hi, I'm figuring out why taking a snapshot of my cluster is necessary and how to do it:
My understanding is that each node within the cluster needs access to the repository location - and to do this I will need to mount a shared file system.
Where does the backup directory reside? On a server completely separate of the cluster, or can the backup exist on a node?
How large is the backup? Surely if an entire cluster's data exists on one server it would take up a large amount of space - also the whole point of ES is to distribute data between nodes?
Correct, although a minor clarification: All data nodes need access to the repository location. If this is a shared file system they will all need it mounted locally. If the location is something like Amazon S3, they'll all need internet access and credentials to upload, etc.
Depends on where the repository location is. You could have the repository location on one of the data nodes, but that wouldn't be a great idea. Snapshots are for disaster recovery, so it should ideally be somewhere else (unrelated storage server, offsite, cloud backup, etc).
The first backup of your index will be approximately the same size as the set of primary shards making up that index. E.g. if the index has 5 primary shards totaling about 50gb, and a replica (which adds another 50gb since it's a copy), the first snapshot will be about 50gb total.
Note: the subsequent snapshots will be a lot smaller. Snapshots store the delta between existing snapshots and the cluster. So the second snapshot will only store the changes that have occurred since the first snapshot and will be considerably smaller because of that.
It is! But the sharding scheme of Elasticsearch is for runtime high-availability. If you lose a node, the cluster is fine because there are replicas and data can be shuffled around.
That's not what snapshots are for. Snapshots are for disaster recovery. If you irreversibly lose a bunch of nodes (datacenter catches on fire, hurricane, bad virus wipes the nodes, hacker encrypts all the data and demands bitcoins, whatever), snapshots are so you can spin up a new cluster and restore it.
Hi, thanks so much for your reply. Just something else that came to mind: the data within the cluster is time based and gets purged if it's over 2 weeks old.. how can I make sure the same occurs in the snapshot? Will I have to take a new snapshot every day to reflect the new state of the cluster, (and delete the old snapshot to save space)?
Also the master nodes, mainly so that the elected master can coordinate the effort to make sure that every data node has access...
It depends what you mean by "purged". If you mean that you are using time-based indices, and you delete entire indices after they have expired, then yes it works to take snapshots frequently and delete old ones in order to remove the expired indices from the repository. If you are not expiring whole indices then the story gets a lot more complicated.
Note that it takes essentially no extra space to snapshot an index that hasn't changed since the last snapshot, because snapshots are taken incrementally. It's not unusual to take snapshots much more frequently than daily because of this - every 30 minutes is a common frequency.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.