I can't find any compression algorithms arguments about HDFS when I use snapshot to do es data backup.
I hope to use hdfs gzip to reduce the size of my backup data,because the "compress" argument of es snapshot seems to be unhelpful. Both with it and without it, the snapshot size eqauls to the raw es index size.
gzip won't help you too much, as the data within the segments already is compressed. The compress flag is also enabled by default and only refers to the metadata.
Have you read the Tune for disk usage documentation, that talks about tricks to reduce your overall dataset?
I try to dump data from ES to text file and compare the size between them, and find that text file size equals to the ES index size.
And then I try to use logstash to import data from ES to hdfs (with compression => "gzip" setting),the gzip file size in hdfs is 12% of ES index size.
And then I try to reindex the data from one ES index with default codec setting to another ES index with "best_compression" codec setting. the size of "best_compression" index is 88% of the default setting index.
So I think data in ES index is organized almost as same as data in a text file, and segment merging 、data/metadata compression is unhelpful. Just gzip or other compression method can really reduce the size.
I prefer to use snapshot ,the only issue is the backup data size.
I don't want to compare it with json dump, but when I use json dump I can find a better way to reduce the size .
I don't understand why snapshot does't support compession like gzip, I think the size is very important when we do data backup.
I know how to compress files on shared FS, but now I need to save data on HDFS and I don't know how to compress files on HDFS.
I just know data can be compressed while writting to HDFS by setting compression arguments, like compression => "gzip" of logstash webhdfs-output-plugin. ES repository-hdfs does not support arguments like that.
I just wanted you to test what would be the size difference between a snapshot not compressed by gzip and a snapshot compressed by gzip. To see effectively if there is a significant reduction in which case that could be an acceptable feature request.
If the win is minimal, then it's probably not worth it to implement such a thing.
The ES index size equals to the size of its snapshot if the snapshot is not compressed.
If I use gzip to compress the snapshot, the size will reduce to 15%.
I'm not sure it's worth to implement but you can ask for a feature request.
Note that I think that you can't compress all files together but one by one (segments by segments may be) to see what could be the exact compression ratio.
BTW did you run your test on indices rebuilt with best_compression codec?
I use logstash to import data from ES index(without best_compression) to HDFS.
Use compression => "gzip" in the logstash webhdfs output plugin, output to one .gz file.
This is the data size on ES
This is the data size on HDFS (the first with gzip compression and the second no compression)
My question is how to compress snapshot backup data on HDFS.
I know I can use logstash to import data from ES to HDFS with gzip argument, or take a snapshot to HDFS and then compress files on HDFS with hadoop API.
But both of them are not graceful.
Today there is no graceful way to what you ask, and I think this is because we don't expect significant savings from gzip compression of snapshots: the data that makes up a snapshot is already compressed.
However I do not know of recent validation of this expectation. If you have evidence that it's worth compressing snapshots further then it may be worth us investigating this idea and maybe adding this feature.
However the results you've shown so far don't help. Your 85% compression ratio is for compressing the JSON source, it seems, and that tells us nothing about the effect of compression on snapshots.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.