Can I use gzip or some other compression algorithms when I create snapshots to HDFS?

wangxr1985 · July 3, 2019, 3:43am

This is the document of repository-hdfs:
https://www.elastic.co/guide/en/elasticsearch/plugins/master/repository-hdfs-config.html

I can't find any compression algorithms arguments about HDFS when I use snapshot to do es data backup.
I hope to use hdfs gzip to reduce the size of my backup data,because the "compress" argument of es snapshot seems to be unhelpful. Both with it and without it, the snapshot size eqauls to the raw es index size.

spinscale · July 3, 2019, 8:36am

gzip won't help you too much, as the data within the segments already is compressed. The compress flag is also enabled by default and only refers to the metadata.

Have you read the Tune for disk usage documentation, that talks about tricks to reduce your overall dataset?

--Alex

wangxr1985 · July 15, 2019, 11:03am

I try to dump data from ES to text file and compare the size between them, and find that text file size equals to the ES index size.
And then I try to use logstash to import data from ES to hdfs (with compression => "gzip" setting),the gzip file size in hdfs is 12% of ES index size.
And then I try to reindex the data from one ES index with default codec setting to another ES index with "best_compression" codec setting. the size of "best_compression" index is 88% of the default setting index.

wangxr1985 · July 15, 2019, 11:15am

So I think data in ES index is organized almost as same as data in a text file, and segment merging 、data/metadata compression is unhelpful. Just gzip or other compression method can really reduce the size.

dadoonet · July 15, 2019, 1:29pm

A backup (snapshot) and a json dump are definitely not the same thing.
You can't compare both things.

With a json dump, you can restore the dump and it will be ready to use where reading a json dump requires to reindex everything.

Not the same data structures.

wangxr1985 · July 16, 2019, 2:30pm

I prefer to use snapshot ,the only issue is the backup data size.
I don't want to compare it with json dump, but when I use json dump I can find a better way to reduce the size .
I don't understand why snapshot does't support compession like gzip, I think the size is very important when we do data backup.

dadoonet · July 16, 2019, 2:44pm

Did you try to gzip a snapshot?

wangxr1985 · July 19, 2019, 11:29am

I don't know how to do that when I use repository-hdfs to do snapshot to hdfs.

dadoonet · July 19, 2019, 11:53am

snapshot to shared FS instead then gzip the directory.

wangxr1985 · July 20, 2019, 4:09am

I know how to compress files on shared FS, but now I need to save data on HDFS and I don't know how to compress files on HDFS.
I just know data can be compressed while writting to HDFS by setting compression arguments, like compression => "gzip" of logstash webhdfs-output-plugin. ES repository-hdfs does not support arguments like that.

dadoonet · July 20, 2019, 5:12am

I just wanted you to test what would be the size difference between a snapshot not compressed by gzip and a snapshot compressed by gzip. To see effectively if there is a significant reduction in which case that could be an acceptable feature request.
If the win is minimal, then it's probably not worth it to implement such a thing.

wangxr1985 · July 20, 2019, 5:22am

The ES index size equals to the size of its snapshot if the snapshot is not compressed.
If I use gzip to compress the snapshot, the size will reduce to 15%.

dadoonet · July 20, 2019, 9:16am

I'm not sure it's worth to implement but you can ask for a feature request.

Note that I think that you can't compress all files together but one by one (segments by segments may be) to see what could be the exact compression ratio.

BTW did you run your test on indices rebuilt with best_compression codec?

wangxr1985 · July 20, 2019, 10:44am

best_compression codec setting will reduce the index size to 88%, still very large.

dadoonet · July 20, 2019, 11:15am

So you are saying that you did the following:

Create an index with best_compression
Index documents
Merged the segments to have only one
Run a snapshot on shared FS
Compress each individual file one by one with GZIP

And you went from xxx gb to xxx * 0.15 gb. Is that what you did exactly?
Could you share the output of du -s in both cases?

wangxr1985 · July 21, 2019, 5:43am

I use logstash to import data from ES index(without best_compression) to HDFS.
Use compression => "gzip" in the logstash webhdfs output plugin, output to one .gz file.

This is the data size on ES

This is the data size on HDFS (the first with gzip compression and the second no compression)

dadoonet · July 21, 2019, 2:18pm

But the question is how elasticsearch shards will be compressed not how the source will be compressed.

wangxr1985 · July 21, 2019, 4:07pm

My question is how to compress snapshot backup data on HDFS.
I know I can use logstash to import data from ES to HDFS with gzip argument, or take a snapshot to HDFS and then compress files on HDFS with hadoop API.
But both of them are not graceful.

DavidTurner · July 21, 2019, 5:36pm

Today there is no graceful way to what you ask, and I think this is because we don't expect significant savings from gzip compression of snapshots: the data that makes up a snapshot is already compressed.

However I do not know of recent validation of this expectation. If you have evidence that it's worth compressing snapshots further then it may be worth us investigating this idea and maybe adding this feature.

However the results you've shown so far don't help. Your 85% compression ratio is for compressing the JSON source, it seems, and that tells us nothing about the effect of compression on snapshots.

wangxr1985 · July 22, 2019, 8:47am

But my test result is: index size = snapshot size = json text file dumped by elasticdump

Topic		Replies	Views
Compressed snapshot for backing up Elasticsearch es-hadoop	2	1526	July 6, 2017
Snapshots don't get compressed Elasticsearch	8	1382	July 6, 2017
Question About Repository-hdfs Elasticsearch es-hadoop	8	2131	July 6, 2017
Elasticsearch with Hadoop HDFS Elasticsearch	3	497	July 6, 2017
Snapshots are uncompressed Elasticsearch	1	423	August 25, 2017

Can I use gzip or some other compression algorithms when I create snapshots to HDFS?

Related topics