I've been investigating some compression ratios of my live logs.
Two main catergories. Auditd logs and other logs like /var/log/messages etc.
Both being ingested via filebeat 7.11 and running 7.11 cluster.
So what I've found is (on an average day, after dividing over a 7 day period).
Auditd raw logs 3.97G
Index size 13.27G
Ratio about 3.34
Other raw logs 3.61G
Index size 10G
Ratio 2.77
These seems abnormally high ration from raw logs to index sizes. In many places, people talk about 30%-100% of raw logs size, I'm getting around 200% of raw log size.
Question:
Can the overhead be reduced. (Overhead being the additional data like filebeat version, source IP, source hostname etc etc.. )
I assume its already using deflate as a compression mode since its 7.11. Are there ways to improve compression ratios.
I'm using standard filebeat index. The /var/log/messages goes into standard filebeat-7.* index.
The auditbeat uses the filebeat-auditd-7.* otherwise the same filebeat standard.
I've sort of inherited support of this. I'm not really sure was any testing done to optimize it.
I understand you can choose to index or not index certain fields. Though I'm not sure if that will help. Its not like these logs have a huge range of field types.
This implementation has been running for more than a year. Change index mappings, can it be done on live indexes? Or must the index be trashed and recreated?
Also setting best_compression, can not be done on running indexes. Is it possible to be set, on next created index in sequence?
Take a look at index templates (or composable templates at they are known now). Basically the idea is you set the template for the next set of indices that are created by the Beats, and they define the mappings and compression method.
Also @michaelv Most people use Raw Logs to Primary storage when referring to Ratio, not total index size which includes replica,,, how many replicas do you have?
BTW that is a pretty old blog there is a slightly newer one here
Thanks Christian.. they are importing something like 50 million records per day.. not sure having a higher overhead is a good idea.
I've read doing warm nodes to do higher compression.. however, they only have 3 nodes. So that is not really an option as its the same node for the hot indicies.
I'm planning to set one index at a time to have the best_compression and observe how much additional cpu its using. The followed by the next largest index etc.. making sure I don't reach so much cpu utilisation that ingestion becomes too slow and I start getting a backlog
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.