First off, how many are we talking about? Second, why would you want to do that? A snapshot captures a moment in time of the selected indices in the cluster at the exact moment the snapshot is initiated. Restoring all snapshots might result in overwriting a lot of data—not to mention require repeatedly closing indices that would be overwritten by newer snapshots being restored after that.
Typically speaking, you will only ever restore indices from a single snapshot from the exact point in time you want to recreate. If you have a different reasoning for restoring all indices from all snapshots, what is it?
You are right, snapshot could contain several indices. I was confused, as read somewhere that snapshot creation is incremental. Can you elaborate on this? Does each snapshot holds all indices size?
One more question, do you know how to keep repository folder and files in s3 bucket more structured? Currently I have some random names there, it would be nice to keep snapshots in folders withYYYY-MM-DD entry per daily snapshot
That's an excellent question! Where incremental means capturing changes to files when considered from a SysAdmin perspective, in Elasticsearch, it means something different.
Think of an index as an odd matryoshka doll, with the outermost layer as the index itself. The next layer inside is the shards of an index. There remains another layer deeper than the shards, however: segments. Segments are the building blocks of shards. With this understanding, hopefully the following will be a sufficient explanation.
As new data comes in to Elasticsearch, the data is flushed in blocks into segments. A segment is completely immutable, and cannot be changed. If data is changed via an update request, the document in that segment is marked for deletion, and the newer document (which exists in a different segment) is considered the valid one. Documents marked for deletion are tracked, and then at segment merges, they are expunged. As the count of segments continuously increases as new data is added, segment merges are continuously happening behind the scenes, automatically. This is because too many segments would eat up all of the free memory in the JVM. A merged segment is the combination of the data from one or more other smaller segments. The new segment is just as immutable as the older ones, but it is new. This understanding is critical to understanding how snapshots work.
When a snapshot is initiated by an API call, Elasticsearch checks to see which indices are chosen, and then looks at the repository to see if any data is on the remote side. What does Elasticsearch look for? Segments. When a snapshot is initialized all existing segments in the selected indices are blocked from being deleted. This state persists until the snapshot is completed. Elasticsearch compares all segments already existing in the repository (if any) with the segments from the selected indices. Only segments not found in the repository are copied over. The repository and snapshot metadata track which segments are required to rebuild the selected indices with "pointers." A subsequent incremental snapshot of the exact same indices as the previous snapshot will only copy new segments, and will also contain pointers to the necessary segments which had been copied in the previous snapshot. This is because it is necessary to have those segments as well, in order to completely recreate the index as it was at the time of the snapshot.
This is how incremental snapshots work: at the segment level. It is important to note the difference between this, and presuming a snapshot works at the data level, like a filesystem backup. As segments merge, the data may not have even changed a bit, but the segments are new. This means that there may very likely be data duplication within a repository, even though there will never be segment duplication.
Ways to prevent this level of duplication include having different snapshot naming conventions, so that you can keep 3 days of hourly snapshots of even your live data, but take daily snapshots of older, unchanging indices (a logging and metrics use case is presumed for this example). With this, you can continue to have hourly data snapshots, but not keep months worth of duplicated data due to frequent segment merges. In many cases, Curator has been used with complex action files, which will force merge old, unchanging indices to a single segment per shard, and then snapshot this data for longer term storage, as the number of segments which need comparison at snapshot initialization time will be considerably smaller.
You should never directly interact with the files in your repository. The only proper way to interface with a snapshot repository is via the API. Those random names are the ones generated by and maintained in the cluster state. Changing them would break things catastrophically.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.