Backup large indexes to other locations on a time-by-time basis and clean up the index

Backup large indexes to other locations on a time-by-time basis and clean up the index

In order to get a good answer it generally helps to describe what you are looking to achieve in more detail and provide context. Just putting a single sentence in the title and body does not provide a lot of details.

Have you looked at the snapshot API, which is the only supported way to back up data in Elasticsearch?

Backing up the earliest data on a monthly basis for an index of 3TB requires saving space and cleaning up the data that has been backed up in that index

Are you using time based indices or are you looking to backup and delete specific data within indices? What is your use case?

Can snapshot be used to back up part of the index based on time

backup and delete specific data within indices
Only the most recent six months of data are kept in the index
For example, back up January's data at the beginning of August, store the backup file to another location, and delete the backed up January data from the database

All built-in retention management in Elasticsearch assumes you are using time-based indices of some kind and work at the index level. There is no support available for selectively managing partial data within indices, so this is something you will need to build yourself through custom scripts/applications.

Thank you very much for your reply. Are there any scripts or similar cases? Just getting started with Elasticsearch requires some help

No, not that I am aware of.

Best practice is generally to use time-based indices if you can as that makes retention management much more efficient (it is a lot faster and less expensive to drop an entire inde than delete a large portion of documents from it). As you have not provided any information or details around the use case and why you are indexing as you are it is however impossible to tell whether this is something that may apply to you or not.

1 Like

I agree with Christian here,

You should look into time based indices or data stream with an optimal ilm policy and backing indicies management.

You could then easily use the snapshot feature to store indicies on another / remote server.

1 Like

What type of data are you storing in Elasticsearch? What is the use case? Do you perform updates on documents or are the documents immutable once written?

the documents immutable once written.
Data already stored in the database will not be modified and new data will be added to the database over time.
I don't know how to determine the timestamp of this index.
is it this?
"timestamp": 1680441211865,

If your data is immutable I would recommend you switch to using time-based indices, e.g. data streams. With this approach all new data is written to the newest backing index and when they reach a certain size a new backing index is created behind the scenes and writing of new data switches to this. This means that only a small portion of indices are actively written to, which means they can be optimised, e.g. through forcemerge. This approach also allows you to manage retention by deleting complete indices, which is a lot more efficient that deleting data through delete-by-query. You can do this through index lifecycle management.

You can then back up old indices through the snapshot API and use the restore API to load them at a later date if needed.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.