Can I use gzip or some other compression algorithms when I create snapshots to HDFS?

This is the document of repository-hdfs:
https://www.elastic.co/guide/en/elasticsearch/plugins/master/repository-hdfs-config.html

I can't find any compression algorithms arguments about HDFS when I use snapshot to do es data backup.
I hope to use hdfs gzip to reduce the size of my backup data,because the "compress" argument of es snapshot seems to be unhelpful. Both with it and without it, the snapshot size eqauls to the raw es index size.

gzip won't help you too much, as the data within the segments already is compressed. The compress flag is also enabled by default and only refers to the metadata.

Have you read the Tune for disk usage documentation, that talks about tricks to reduce your overall dataset?

--Alex

I try to dump data from ES to text file and compare the size between them, and find that text file size equals to the ES index size.
And then I try to use logstash to import data from ES to hdfs (with compression => "gzip" setting),the gzip file size in hdfs is 12% of ES index size.
And then I try to reindex the data from one ES index with default codec setting to another ES index with "best_compression" codec setting. the size of "best_compression" index is 88% of the default setting index.

So I think data in ES index is organized almost as same as data in a text file, and segment merging 、data/metadata compression is unhelpful. Just gzip or other compression method can really reduce the size.

A backup (snapshot) and a json dump are definitely not the same thing.
You can't compare both things.

With a json dump, you can restore the dump and it will be ready to use where reading a json dump requires to reindex everything.

Not the same data structures.

I prefer to use snapshot ,the only issue is the backup data size.
I don't want to compare it with json dump, but when I use json dump I can find a better way to reduce the size .
I don't understand why snapshot does't support compession like gzip, I think the size is very important when we do data backup.

Did you try to gzip a snapshot?

I don't know how to do that when I use repository-hdfs to do snapshot to hdfs.

snapshot to shared FS instead then gzip the directory.

I know how to compress files on shared FS, but now I need to save data on HDFS and I don't know how to compress files on HDFS.
I just know data can be compressed while writting to HDFS by setting compression arguments, like compression => "gzip" of logstash webhdfs-output-plugin. ES repository-hdfs does not support arguments like that.

I just wanted you to test what would be the size difference between a snapshot not compressed by gzip and a snapshot compressed by gzip. To see effectively if there is a significant reduction in which case that could be an acceptable feature request.
If the win is minimal, then it's probably not worth it to implement such a thing.

The ES index size equals to the size of its snapshot if the snapshot is not compressed.
If I use gzip to compress the snapshot, the size will reduce to 15%.

I'm not sure it's worth to implement but you can ask for a feature request.

Note that I think that you can't compress all files together but one by one (segments by segments may be) to see what could be the exact compression ratio.

BTW did you run your test on indices rebuilt with best_compression codec?

best_compression codec setting will reduce the index size to 88%, still very large.

So you are saying that you did the following:

  • Create an index with best_compression
  • Index documents
  • Merged the segments to have only one
  • Run a snapshot on shared FS
  • Compress each individual file one by one with GZIP

And you went from xxx gb to xxx * 0.15 gb. Is that what you did exactly?
Could you share the output of du -s in both cases?

I use logstash to import data from ES index(without best_compression) to HDFS.
Use compression => "gzip" in the logstash webhdfs output plugin, output to one .gz file.

This is the data size on ES
image

This is the data size on HDFS (the first with gzip compression and the second no compression)

But the question is how elasticsearch shards will be compressed not how the source will be compressed.

My question is how to compress snapshot backup data on HDFS.
I know I can use logstash to import data from ES to HDFS with gzip argument, or take a snapshot to HDFS and then compress files on HDFS with hadoop API.
But both of them are not graceful.

Today there is no graceful way to what you ask, and I think this is because we don't expect significant savings from gzip compression of snapshots: the data that makes up a snapshot is already compressed.

However I do not know of recent validation of this expectation. If you have evidence that it's worth compressing snapshots further then it may be worth us investigating this idea and maybe adding this feature.

However the results you've shown so far don't help. Your 85% compression ratio is for compressing the JSON source, it seems, and that tells us nothing about the effect of compression on snapshots.

1 Like

But my test result is: index size = snapshot size = json text file dumped by elasticdump