I am trying to find a solution to where I can keep roughly 90 days of live data on my cluster but then archive anything over 90 days up to a year. This is a compliance requirement.
I am currently looking into Curator as an automated solution, but the Snapshot function only does metadata and I am looking for something that will also compress my data, is there a function of Elasticsearch that I am missing?
Snapshots are full copies of all of the segments within the indices/shards you tell the API you want to have in the snapshot. Compression of data occurs at the segment-level, so technically, your data is already compressed, but in the exact same format it exists in on your nodes.
If you want to compress data further, you would have to use other means, like a compressed filesystem (ZFS, perhaps, though it won't further compress the LZ4 or DEFLATE segments much further), or exporting and compressing the raw JSON data.
Sorry, I realize this is confusing now looking back at it, when looking at the Elasticsearch documentation in regards to Snapshot, the repository is created with compression but the compression only applies to metadata files (index mapping and settings).
So, the compressing of data further is what I am looking for. I essentially have an index, for example, that is 10GB of data. After we hit our 90 day period, we will likely not need to look at that data again, ever, but due to compliance reasons, we need to keep it in case someone does want to look at it.
Would it be possibly for me to just use something like gzip on those files?
Another question that just occurred to me is that when creating the snapshots, is there a way to have the index folder that is created be renamed to whatever it's alias is?
For example, if I have a folder with the name of leCqaJuESyCYg_VL7Xl3og but the english version of the index is index_123, is there a way, during the snapshot process to name that folder as the easier to read version?
As stated, a file like leCqaJuESyCYg_VL7Xl3og corresponds to an index name, and the files in that directory are index data/segments. This data is already compressed using at least LZ4, and potentially DEFLATE if you enabled best_compression (and further compressing it may not even yield the benefits you are hoping for, while incurring considerable extra cost/effort). You cannot rename or alter these names in any way or it will invalidate your snapshot. The cluster metadata requires that these remain completely consistent. As such, you cannot just gzip the directory.
Some people have taken to putting a full snapshot into a dedicated S3 repository/bucket path, and then putting that full path into Glacier for long-term storage, which is less expensive and can persist for a long time. That may be an option for you, as you can have multiple snapshot repositories, and simply choose which one to snapshot into at snapshot time.
The same is true for a local/NFS snapshot. You could create a clean, new snapshot repository in a dedicated filesystem path, and then tar/gzip that path. You won't get much better compression, but it would be something. You would then not want to re-use that path for any other snapshots, but probably even remove the repository altogether.