I do not think removing old snapshots is the solution because if the old snapshot has segments that newer snapshots do not have, we are not able to recover the data after removing old snapshots...
Does elasticsearch have a way to identify if any snapshots are safe to remove because latest snapshots 'cover' them? Or does elasticsearch have a way to clean up old backup segments that are covered by the latest segments.
The other solution is periodically generating a new snapshot from scratch... but I am not sure if this is the best solution.
Segments are reference counted, so will not be removed as long as there is one snapshot that uses them even if they were copied as part of a snapshot that is being deleted. You can therefore safely remove older snapshots without compromising the integrity of newer ones.
IT is handled automatically by the snapshot process and not visible as far as I know. This blog post is a bit old but describes how it works quite well.
Does it mean snapshot automatically 'recycle' useless segments?
I feel this is not quite possible if it does not know which snapshot users do not want to keep.
Also, I feel this may not be what I asked, it could be that my question is confusing. I will be giving an example to explain my question.
I have a full snapshot S0. After that I made daily snapshots S1, S2, ... Sn
I only planned to restore from the latest snapshot Sn.
When n gets larger, the total size of all Si can be getting larger and larger.
So it could be that an very old Sj refers to segA, segB, and a newer Si (i>j) refers to the new segA' and segB'. So if I only restore from Sn (n>=i), we only need segA' and segB' but not segA and segB.
So in my case, it is safe to remove segA and segB from the repository. This can reduce the size.
However, I do not think elastic search can do this automatically, because we do need segA and segB if anyone wants to restore from Sj.
Another solution to reduce size is creating a new full-snapshot from scratch weekly?
If snapshot S0 copies a segment that then does not change, Snapshot S1 will then not copy it again but instead reference it. If you then delete snapshot S0, this segment stays in the repository as it is still used by snapshot S1.
Snapshot S1 will therefore contain all segments that were present in the cluster at the time it was taken no matter when they were copied.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.