Raw logs to data compression ratio

michaelv · May 24, 2022, 2:43am

Hi All,

I've been investigating some compression ratios of my live logs.

Two main catergories. Auditd logs and other logs like /var/log/messages etc.
Both being ingested via filebeat 7.11 and running 7.11 cluster.

So what I've found is (on an average day, after dividing over a 7 day period).
Auditd raw logs 3.97G
Index size 13.27G
Ratio about 3.34

Other raw logs 3.61G
Index size 10G
Ratio 2.77

These seems abnormally high ration from raw logs to index sizes. In many places, people talk about 30%-100% of raw logs size, I'm getting around 200% of raw log size.

Question:

Can the overhead be reduced. (Overhead being the additional data like filebeat version, source IP, source hostname etc etc.. )
I assume its already using deflate as a compression mode since its 7.11. Are there ways to improve compression ratios.

Regards,

Michael

warkolm · May 24, 2022, 2:44am

What do the mappings for these look like, have you spent time working on them?

michaelv · May 24, 2022, 2:57am

Hi Mark,

I'm using standard filebeat index. The /var/log/messages goes into standard filebeat-7.* index.
The auditbeat uses the filebeat-auditd-7.* otherwise the same filebeat standard.

Regards,

Michael

michaelv · May 24, 2022, 2:59am

I've sort of inherited support of this. I'm not really sure was any testing done to optimize it.

I understand you can choose to index or not index certain fields. Though I'm not sure if that will help. Its not like these logs have a huge range of field types.

warkolm · May 24, 2022, 3:02am

The default mappings are not super efficient, so you should really take a look at them.

michaelv · May 24, 2022, 3:19am

This implementation has been running for more than a year. Change index mappings, can it be done on live indexes? Or must the index be trashed and recreated?

Also setting best_compression, can not be done on running indexes. Is it possible to be set, on next created index in sequence?

warkolm · May 24, 2022, 3:23am

Take a look at index templates (or composable templates at they are known now). Basically the idea is you set the template for the next set of indices that are created by the Beats, and they define the mappings and compression method.

Saves manually doing it.

michaelv · May 24, 2022, 3:47am

Okay I see what you mean.
If I put "index.codec: best_compression" in Elasticsearch.yml from the recommendation from Part 2.0: The true story behind Elasticsearch storage requirements | Elastic Blog

Would that effect all newly created indexes?

Regards,

Michael

warkolm · May 24, 2022, 3:49am

If it's in an appropriate template, yes.

stephenb · May 24, 2022, 4:47am

Also @michaelv Most people use Raw Logs to Primary storage when referring to Ratio, not total index size which includes replica,,, how many replicas do you have?

BTW that is a pretty old blog there is a slightly newer one here

Here are some of the latest

michaelv · May 24, 2022, 5:17am

HI Stephen,

1 replica only. The blog post was given by ELK support staff.
I'll read the newer blog post and see what its about.

Regards,

Michael

michaelv · May 24, 2022, 6:25am

One odd questin.

Using index.codec: best_compression
I thought 7.10 by default uses deflate as the compression. Why would you need to hardcode the best_compression?

I've tried modifying the template, and yes the new index takes the best_compression index.

Regards,

Michael

Christian_Dahlqvist · May 24, 2022, 6:43am

No, it does not.

Best compression is not the default as it adds a significant amount of overhead at indexing time.

michaelv · May 24, 2022, 8:27am

Thanks Christian.. they are importing something like 50 million records per day.. not sure having a higher overhead is a good idea.

I've read doing warm nodes to do higher compression.. however, they only have 3 nodes. So that is not really an option as its the same node for the hot indicies.

Christian_Dahlqvist · May 24, 2022, 9:19am

For lower data volumes the overhead of tbe best compression codec may be irrelevant.

stephenb · May 25, 2022, 8:07pm

50M records / day

~600 Records / sec Avg

Certainly in the realm / value of best compression... I have use cases 10-20x+ using best compression

Basically it is a TCO and performance equation.

You might need to trade /add a couple % of CPU etc for greater storage savings.

I would say test and measure.

So what I was kinda saying before if your numbers are for total storage ... most people refer to primary storage when they speak of ratio

Auditd raw logs 3.97G
Index size 13.27G <!--- So Primary is about 6.6GB
Ratio about 3.34 <!--- So Ratio is about 1.5

Other raw logs 3.61G
Index size 10G <!--5GB
Ratio 2.77 < 1.4 or so

Not that make your use case suddenly better / less overall storage but that is the math most folk mean when talking about.

That said perhaps tuning your mappings and best_compression might make a meaningful difference

michaelv · May 26, 2022, 9:23am

Thanks Stephen!

I'm planning to set one index at a time to have the best_compression and observe how much additional cpu its using. The followed by the next largest index etc.. making sure I don't reach so much cpu utilisation that ingestion becomes too slow and I start getting a backlog

system · June 23, 2022, 9:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Stored logs and compression Elasticsearch	3	2983	October 19, 2017
Filebeat: how to archive data or reduce primary Elasticsearch	5	991	July 5, 2017
ElasticSearch index size peculiarity Elasticsearch	2	668	July 6, 2017
Elasticsearch Compression ratio Elasticsearch	6	20134	August 15, 2017
Fit size of filebeat Beats filebeat	2	267	April 1, 2021

Raw logs to data compression ratio

Related topics