Logs are getting big in elasticsearch

Hey folks, I got a running ELK Stack with Filebeat, Logstash and Kibana.

It startet do receive logs, especcially apache-access and apache-error logs. Unfortunately the logs are getting really big, e.g:
12GB Access Logs are 21Gb in Elasticsearch
109MB Access Logs are 255MB in Elasticsearch
3.9MB error logs are 32MB in Elasticsearch

After every "Benchmarking" i deleted the elasticsearch database completly and did a
sudo du -sch /var/lib/elasticsearch

so I googled a lot and figgured out that using "index.codec: best_compression" in the elasticsearch.yml could help a bit. Trying this i received different results:
109MB Access Logs are 384MB in Elastisearch after some time it reduced to 246MB
3.9MB error logs are 32MB in Elasticsearch, after some time it reduced to 14MB.

The first question is: Is this normal?
The 2nd question is: What did I wrong?
The 3rd question is: How can i reduce the space-amount easily?
I know by deleting the _source or i cut away a part of the message, but I don't know how exactly I can do this.

Here is the use case:

Receiving Apache-Access and Apache-Error logs from Webservers, finally we are getting 800MB log files every day. They should be stored like 6 - 8 weeks. It isn't nessercary that elasticsearch is very fast, high compression is important.

here is the elasticsearch.yml
cluster.name: mura_test
node.name: Ashigaru
network.host: localhost
index.codec: best_compression

Other settings are just default

Logstash conf:
    grok {
                        match => { "message" => "%{COMBINEDAPACHELOG}" }
                }
                date {
                        match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
                }
        }

        if [type] == "apache-access" {
                grok {
                        match => { "message" => "%{COMBINEDAPACHELOG}" }
                }
                date {
                        match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
                        }
                }
        if [type] == "apache-error" {
                grok {
                        patterns_dir => ["/etc/logstash/patterns"]
                        match => { "message" => "\[(?<timestamp>%{DAY} %{MONTH} %{MONTHDAY} %{TIME} %{YEAR})\]\s\[.*:%{LOGLEVEL:loglevel}\]\s\[\w+\s%{NUMBER:pid}\]\s(?<issue>\(\d+\)\D*:\s)?(\[client\s%{HOSTPORT:client}\])%{GREEDYDATA:message}" }
                }
                date {
                        match => [ "timestamp", "EEE MMM dd HH:mm:ss.SSSSSS YYYY" ]
                }

                mutate {
                        remove_field => ["timestamp","fields.env","beat.version","beat.name","input_type"]
   
                }
        }



    output {

        stdout { codec => rubydebug }
        elasticsearch {
                hosts => "localhost:9200"
                 }
}

PS: I am thinking of creating an index pattern for the Logfiles, but this seems complicated.

The size data takes up on disk depends a lot on to what extent you enrich them, e.g. by adding fields and information by running the group filter, and what your mappings you are using. By default Logstash and Elasticsearch provide mappings that give a lot of flexibility in order to make it easy to use and get started with. This however results in many fields being indexed in multiple ways, which takes up disk space. This blog post is getting a bit old, but talks about the effect different types of mappings can have on storage space. It is also worth noting that very small shards tend to compress less well than larger shards, so try to avoid having too many small shards, as this can be inefficient. Having a shard size ranging from a few GB to a few tens of GB in size is common, and likely a good target to aim for.

So instead of splitting up a log message in CLIENT IP, Referrer, Request , xyz, (in my Understanding every field enriches the orignall Log and this costs more space) is it wiser to have a bigger field for compression?

You want to parse the log file and split it into fields, as this allows you to analyse the data in Kibana. This parsing is not what I mean by enrichment.

Some Logstash plugins, e.g. the geoip and user agent filters, add additional information and fields based on fields in your data, and this will increase the number of fields and size of the document. It is possible that not all these fields are required for your use case, and in that case you can remove some of these to save space.

Also try to make sure that your mappings match how you will query the data, as this can affect the size the data takes up on disk as described in the blog post I linked to.

Are you using replica ? This probably will increase disk being used since you have a copy of your data.

Besides that as they mentioned you could remove some fields, because based on your config file you are saving raw log plus splitted fields.

I'd suggest you to watch this great talk from Elasticon 2016 where they mentioned how to improve ratio rates for example https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

Hope it helps!

Regards,

Well, I didn't change anything else in the elasticsearch.yml, so using "default" replicas.
Thanks for the Link, will look into that.

With raw log, you meaning my "message", correct?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.