Json format too bloated

we store not too small json documents in elasticsearch. We will add 10s of millions of documents every month.

We found that storing 30 mio documents eats up 2 TB of data.

As the format is json, there is a lot of repetition in each document (every key repeats in every document). So we started to shorten the keys and could reduce the size to 300 GB.

That's good but I feel its still way too much. If we wouldn't have the json boilerplate we still could reduce it by a factor of 10 easily.

On the other hand I feel that messing around with json keys and transforming them from something human understandable like 'baseCurrencyAmount' into something barely readable like 'baCA' is in general the wrong direction.

Is there any hint you can give?

And side question: it seems that elasticsearch compression compresses each document separately. As if would compress it somehow jointly, my guess is that the key lenghts would not play a role any more.

Welcome to our community! :smiley:

This is really something that you cannot escape, there's a cost involved in the process no matter which way you approach it. The cost of increasing compression, for eg, would be higher CPU use.
Ultimately you need to figure out which of the costs is the most efficient for you to pay.

I don't know enough about the mechanics of compression in Elasticsearch to comment on your last question sorry. But make sure you are using best_compression and you are force merging to optimise resources on that front.

Also, have optimised your mapping, or are you using dynamic mappings?

Also, have optimised your mapping, or are you using dynamic mappings?

we are using templates (if this is what you mean)

Yeah, so you're defining your mapping ahead of time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.