ElasticSearch to FileSystem(EC2) snapshot is dump or sync?

(Pratyush Agarwal) #1

Hi Community,

I am trying to take a snapshot from elastic search to EC2 instance and then want to update to S3.

curl -XPUT 'http://localhost:9200/_snapshot/s3-backup/snapshot_1?wait_for_completion=true'

Above command will make a dump copy to EC2 instance (in the specified folder) or will it sync and update the delta only?

Also, I am using following command to sync with S3. Will this update the delta data or will dump all the data
aws s3 --region us-east-1 sync /var/www s3://osSourceCode.s3-website

If any of them is not syncing then can you please provide me any other alternatives (if possible) for syncing

(Zachary Tong) #2

If this is the first time you run that snapshot command, it will initiate a full backup of the entire index.

If you've run that command before (a snapshot already exists), it will sync the delta only. Only the changes between now and the last snapshot will be sent to the repository.

No idea about the aws s3 command though :slight_smile:

(Pratyush Agarwal) #3

thanks @polyfractal for the reply, by sync do you mean that it will also delete the data in fileSystem which is removed from elasticsearch (apart from sending the new entries) ?

(Zachary Tong) #4

Ah, sorry, I wasn't very clear.

Elasticsearch data at a low level is stored in "segments". One shard will be composed of many segments. Segments are immutable (once written, we never change them until they are deleted), and some of these segments can be very long-lived.

So when we snapshot an index, we make a note of all active segments and transfer the segment contents to the repository.

When a second snapshot is made, we make a note of all active segments and see which segments are new compared to what we've stored in the repository. Only the new segments are sent. If some of the segments have been deleted between the first and second snapshot, nothing happens. We know they aren't used in the second snapshot, but the first snapshot is still using them. If we were to delete the now-unused segments we'd corrupt the old snapshot.

However, if you delete the first snapshot, Elasticsearch will compare the two snapshots and find all segments that are only used by the first, and delete just those. So it basically does a delta-delete between the snapshot and all other snapshots, just removing the unneeded ones.

Hope that helps clear things up. Basically, you won't see any space reclaimed in the snapshot repository unless you also delete old snapshots. And even then, the amount reclaimed may be variable. E.g. if you snapshot an index twice very quickly, the segments won't have changed, so the snapshots will be identical. Deleting the first snapshot won't change anything.

(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.