Correct, although a minor clarification: All data nodes need access to the repository location. If this is a shared file system they will all need it mounted locally. If the location is something like Amazon S3, they'll all need internet access and credentials to upload, etc.
Depends on where the repository location is. You could have the repository location on one of the data nodes, but that wouldn't be a great idea. Snapshots are for disaster recovery, so it should ideally be somewhere else (unrelated storage server, offsite, cloud backup, etc).
The first backup of your index will be approximately the same size as the set of primary shards making up that index. E.g. if the index has 5 primary shards totaling about 50gb, and a replica (which adds another 50gb since it's a copy), the first snapshot will be about 50gb total.
Note: the subsequent snapshots will be a lot smaller. Snapshots store the delta between existing snapshots and the cluster. So the second snapshot will only store the changes that have occurred since the first snapshot and will be considerably smaller because of that.
It is! But the sharding scheme of Elasticsearch is for runtime high-availability. If you lose a node, the cluster is fine because there are replicas and data can be shuffled around.
That's not what snapshots are for. Snapshots are for disaster recovery. If you irreversibly lose a bunch of nodes (datacenter catches on fire, hurricane, bad virus wipes the nodes, hacker encrypts all the data and demands bitcoins, whatever), snapshots are so you can spin up a new cluster and restore it.
Hi, thanks so much for your reply. Just something else that came to mind: the data within the cluster is time based and gets purged if it's over 2 weeks old.. how can I make sure the same occurs in the snapshot? Will I have to take a new snapshot every day to reflect the new state of the cluster, (and delete the old snapshot to save space)?
Also the master nodes, mainly so that the elected master can coordinate the effort to make sure that every data node has access...
It depends what you mean by "purged". If you mean that you are using time-based indices, and you delete entire indices after they have expired, then yes it works to take snapshots frequently and delete old ones in order to remove the expired indices from the repository. If you are not expiring whole indices then the story gets a lot more complicated.
Note that it takes essentially no extra space to snapshot an index that hasn't changed since the last snapshot, because snapshots are taken incrementally. It's not unusual to take snapshots much more frequently than daily because of this - every 30 minutes is a common frequency.