If you are using daily indices I assume you are only writing to the current one. If this is the case older indices will effectively become read-only (you may want to enforce this through ILM policy) and the segments no longer change after the first couple of days. Indices are then deleted, which does not affect the segments.
Why not? I can test this on a single node on a laptop by using smaller indices and reduced data volumes.
Youāre asking questions in slightly roundabout ways, drip feeding new information and random restrictions
2 pieces of advice:
Any testing you do, to be truly informative, should be on system whose patterns are as similar as possible to your production use case. Same as any sort of integration testing.
You seem concerned with what you call bloat. Consider carefully the scenarios your snapshots are intended to protect/mittigate. Your production dataset is 600GB, by most standards thatās pretty small. I donāt know how many restorable snapshots you are aiming to keep, or for how long, but the implication is you are trying to minimise this? Bad things canāt be guaranteed to happen at convenient times, nor are all bad things picked up right away.
I created a backup of an Elasticsearch cluster that was 716 MB in size. After taking the snapshot, the folder size on the operating system was as follows. Subsequently, I added an index that was 142.8 MB, and after taking another snapshot, the size of the "indices" folder increased to 864 MB, which was expected. However, after dropping the index and taking another snapshot, I noticed that the size did not decrease. Shouldn't the related segments be removed once we drop the index?
[root@prestovm1 clusterbkp]# du -sh indices/
718M indices/
No. When adding or removing indices, it is essential to take a fresh snapshot of the cluster and delete any old snapshots that are no longer needed. This approach ensures efficient space management at the OS level.
Do you mean you only ever want one and exactly one snapshot, taken X hours/days/whatever ago, on your disk repository? On an ongoing basis?
Anything that was in an elasticsearch index BEFORE that single last snapshot was taken, but was subsequently deleted, is of no interest to you?
(I would not personally call that an effective backup strategy)
There seems a misunderstanding here.
If you made a snapshot when indexA was part of the cluster, and have not actively removed that snapshot, then you should still have a restorable snapshot that could be used to restore indexA to its state at the time that snapshot was made.
If you subsequently delete indexA and create and populate indexB, and then make another snapshot , you should then have 2 restorable snapshots, 2 restore points, albeit in the same repo.
On the slightly wider topic ā¦
In my experience most people would work with a number of restore points in their repos. This is similar to old-school tape backups, where retention policies would eventually allow tapes to be re-used. But in these things, and itās uglier great uncle Disaster Recovery, there are usually other complications. Thereās nothing wrong with trying to keep things simple, but please keep in mind the reasons you are making snapshots. Maybe a slight shift of focus away from trying to minimise bytes used. IMHO
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.