Hi,
We use elasticsearch snapshot API to take backup of ES data daily and also regularly copy the entire content from the snapshot repository to another remote system.
Now that we have huge amount of data in ES (in TBs), and the daily increment to actual data is very less (order of few GBs), we do not think it is feasible to transfer TBs of backed-up data to the remote system everyday.
As ES snapshots are incremental: each snapshot of an index only stores data that is not part of an earlier snapshot --
Is it possible to clearly identify only the incremental changes to the snapshot repo as part of creation of the snapshot?
This would help us to only transfer the incremental snapshot content to the remote system and not the entire thing.
If yes, can you please help us understand how that can be achieved.
If point 1 can be achieved, how do you think restore would work? Can the incremental snapshots be combined and placed into the snapshot repo for restore to ES?
Note: Version used - elasticsearch-oss:7.0.1
Any pointers/suggestions would be appreciated.
Today the best you can do is make sure that no snapshots are currently running and then transfer any new/changed files to the remote system. A tool like rsync would do this, but be warned that if it fails part-way through a transfer then it may leave the remote repository in an inconsistent state. It's probably simpler and more reliable to snapshot directly to the remote system using a shared filesystem (or, e.g something like Minio). Or just use a public cloud for your snapshots, they're very reliable and secure and not that expensive given the hassle you're currently facing. It only costs about $25 per month to store 1TB of data on S3.
To restore, you would need to make the entire repository available to Elasticsearch, either by exposing it on a shared filesystem (or e.g. something like Minio) or else by copying all the files to somewhere that Elasticsearch can access.
Thanks for the reply.
Sorry if i was not clear with my question.
We already have a process which would take the backup of ES data. The sequence of backup is as below, when backup is triggered (ideally a cron which runs daily)
Snapshot API runs and stores the snapshot to a local volume say es_backup (glusterfs)
Another process runs which does tar of the content of es_backup and sends this to remote repo.
The sequence followed during restore:
Clear the es_backup volume.
Copy the content from remote repo and untar to es_backup.
Run snapshot API to restore from es_backup.
Issue:
ES is used to store huge data around 5 TB
Amount of data that is pushed everyday is in few GBs.
Even though snapshot API does incremental backup, but the backup which is moved to remote repo is the complete content of es_backup (Not incremental)
Expectation:
Just like how ES snapshot api takes incremental backup, is it possible to move ONLY this incremental backup to the remote repo? How do we identify only the incremented changes?
If this is possible, if in case of disaster and we want to restore last 7 days data, copying back the previous 7 days incrementally backed-up data from remote repo to es_backup and running the restore api will restore the data in ES ?
I hope I was able to explain my question better now Pls let me know if more details are required.
I think your original question was quite clear, and my answer to your clarification is the same.
Compare the files (e.g. by their last-modified date) for instance using a tool like rsync, but be warned that this might be unreliable. Better to snapshot directly to the remote system.
No, Elasticsearch needs access to the whole repository, not just the last 7 days of changes.
Everyday a new index gets created (in the format log-<yyyy.mm.dd> with new data ingested to it.
We have configured the retention of indices as 20d (using elasticsearch-curator that runs everyday).
So that means, everyday all indices older than 20d get deleted.
Then the elasticsearch snapshot API is used to backup the current data in ES incrementally.
Now, can only the last 20 days incrementally backed-up data be restored ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.