I'm a little confused on the behavior of the snapshotting process, and how to configure it to support our use case.
In our use case, we are collecting time-based log events from a variety of different servers that we host, and we need to retain this log data permanently in order to report (and possibly reproduce) usage statistics on our website. The log data for each server is indexed into a separate index, and each index is rolled-over when it reaches a specific size or age limit. Thus, we have a growing list of indices that looks as follows:
load-balancer-A-000001, -000002, etc.
load-balancer-B-000001, -000002, etc.
apache-A-000001, -000002, etc.
...and many, many more...
From my understanding of the documentation (and other discussions here), each incremental snapshot only includes references to the indices that changed since the last snapshot. Further, we must specify the "Number of snapshots to retain," for which the upper limit appears to be 200.
Given that all of our indices have already rolled-over at least once, if we were to begin taking daily snapshots starting tomorrow, does this mean that after 201 days we will lose all of the data for indices prior to their first rollover (e.g., "load-balancer-A-000001"), since they didn't change between the first and second snapshots? If so, how do we get around this limitation?
Before getting to the actual question I would like to say that this sounds like a potentially problematic way of indexing data. Be aware that having lots of small indices and shards is very inefficient and very likely to cause you performance and maybe also stability problems down the line. I would recommend you read this blog post around sharding and reconsider splitting your indices per host. Aim for an average shard size of at least a few GB in size.
Snapshots are not strictly incremental and each snapshot contains all the data in the cluster at the time the snapshot is taken. For efficiency not all segments are copied for each snapshot though and segments that have not changed since the last snapshot, e.g. for indices that are no longer indexed into, are reused and marked as used by multiple snapshots. When you delete an old snapshot only segmenst not in use by any other snapshot are deleted, so you will never have partial indices. This blog post is very very old, but still describes what goes on behind the scenes quite well. Even though it in parts may be out of date I believe the core principles are still accurate.
Thanks @Christian_Dahlqvist for the helpful response and links to useful articles.
First I want to explain that we aren't actually indexing each unique host into its own index, but rather each one of our microservices gets its own index, into which all of the hosts for that microservice send their documents. It just naturally works out that each microservice can be described by the load balancer on top of it, hence our index naming scheme. Also, we are allowing each index to grow to 30GB/shard before rolling over.
Your explanation and that article definitely help me understand the process better. Basically, even though old indices might not change from snapshot to snapshot, some reference to those indices will remain in tact through at least one snapshot no matter how many snapshotting cycles you go through. Is that correct?
The question I now have is, if we were to delete an old index and continue the snapshotting process, eventually that old index would disappear from our snapshots, right? Given this, what's the recommended approach for retaining a copy of that old index, in case we need to restore it later?
The case I have in mind for wanting to do this is that our indices do grow larger and larger every year (e.g., more than 1.5 billion documents per year), but we rarely need to search them after a year or two. So, we'd like to have a backup copy of them that we could restore if needed, but don't necessarily need the index to stay sitting on our cluster 10 years from now. How would we achieve this in light of the fact that the snapshot process would eventually lose reference to that index once deleted?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.