Raw logs to data compression ratio

Hi All,

I've been investigating some compression ratios of my live logs.

Two main catergories. Auditd logs and other logs like /var/log/messages etc.
Both being ingested via filebeat 7.11 and running 7.11 cluster.

So what I've found is (on an average day, after dividing over a 7 day period).
Auditd raw logs 3.97G
Index size 13.27G
Ratio about 3.34

Other raw logs 3.61G
Index size 10G
Ratio 2.77

These seems abnormally high ration from raw logs to index sizes. In many places, people talk about 30%-100% of raw logs size, I'm getting around 200% of raw log size.

Question:

  1. Can the overhead be reduced. (Overhead being the additional data like filebeat version, source IP, source hostname etc etc.. )
  2. I assume its already using deflate as a compression mode since its 7.11. Are there ways to improve compression ratios.

Regards,

Michael

What do the mappings for these look like, have you spent time working on them?

Hi Mark,

I'm using standard filebeat index. The /var/log/messages goes into standard filebeat-7.* index.
The auditbeat uses the filebeat-auditd-7.* otherwise the same filebeat standard.

Regards,

Michael

I've sort of inherited support of this. I'm not really sure was any testing done to optimize it.

I understand you can choose to index or not index certain fields. Though I'm not sure if that will help. Its not like these logs have a huge range of field types.

The default mappings are not super efficient, so you should really take a look at them.

This implementation has been running for more than a year. Change index mappings, can it be done on live indexes? Or must the index be trashed and recreated?

Also setting best_compression, can not be done on running indexes. Is it possible to be set, on next created index in sequence?

Take a look at index templates (or composable templates at they are known now). Basically the idea is you set the template for the next set of indices that are created by the Beats, and they define the mappings and compression method.

Saves manually doing it.

Okay I see what you mean.
If I put "index.codec: best_compression" in Elasticsearch.yml from the recommendation from Part 2.0: The true story behind Elasticsearch storage requirements | Elastic Blog

Would that effect all newly created indexes?

Regards,

Michael

If it's in an appropriate template, yes.

Also @michaelv Most people use Raw Logs to Primary storage when referring to Ratio, not total index size which includes replica,,, how many replicas do you have?

BTW that is a pretty old blog there is a slightly newer one here

Here are some of the latest

HI Stephen,

1 replica only. The blog post was given by ELK support staff.
I'll read the newer blog post and see what its about.

Regards,

Michael

One odd questin.

Using index.codec: best_compression
I thought 7.10 by default uses deflate as the compression. Why would you need to hardcode the best_compression?

I've tried modifying the template, and yes the new index takes the best_compression index.

Regards,

Michael

No, it does not.

Best compression is not the default as it adds a significant amount of overhead at indexing time.

Thanks Christian.. they are importing something like 50 million records per day.. not sure having a higher overhead is a good idea.

I've read doing warm nodes to do higher compression.. however, they only have 3 nodes. So that is not really an option as its the same node for the hot indicies.

For lower data volumes the overhead of tbe best compression codec may be irrelevant.

50M records / day

~600 Records / sec Avg

Certainly in the realm / value of best compression... I have use cases 10-20x+ using best compression

Basically it is a TCO and performance equation.

You might need to trade /add a couple % of CPU etc for greater storage savings.

I would say test and measure.

So what I was kinda saying before if your numbers are for total storage ... most people refer to primary storage when they speak of ratio

Auditd raw logs 3.97G
Index size 13.27G <!--- So Primary is about 6.6GB
Ratio about 3.34 <!--- So Ratio is about 1.5

Other raw logs 3.61G
Index size 10G <!--5GB
Ratio 2.77 < 1.4 or so

Not that make your use case suddenly better / less overall storage but that is the math most folk mean when talking about.

That said perhaps tuning your mappings and best_compression might make a meaningful difference

Thanks Stephen!

I'm planning to set one index at a time to have the best_compression and observe how much additional cpu its using. The followed by the next largest index etc.. making sure I don't reach so much cpu utilisation that ingestion becomes too slow and I start getting a backlog

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.