ELK cluster disk space usage optimization

Hello,

As a Newbie in elasticsearch I'm wondering what could be your tips of optimizing my indexes in order to lower down their maximum size.

My setup:
3 nodes, 3 shards 1 replica.

At the moment I'm documenting around 6million documents per day which is costing me 6-7-8 GB (depends) of disk space. I've used the mutate filter of logstash in order to remove some unneeded fields from the apache logs I store like:
remove_field => [ "@message", "@source", "ident", "auth", "ZONE" ]

I wonder is there anything that I can place in elasticsearch.yml that can help me reduce disk usage? I placed this setting index.compress.stored: true but i can't see any dramatic change.

Thanks in advance.

The amount of space the data takes up on disk once indexed depends on the fields you have in the documents as well as your mappings. You can optimise the mappings used by Logstash to reduce the size, and this blog post contains a discussion around what an be done and what the tradeoffs are.

remove_field => [ "@message", "@source", "ident", "auth", "ZONE" ]

Do you really have @message and @source fields? What Logstash are you running?

yep, is it wrong?

It seems you're running a really really old version of Logstash (1.1 or something). It doesn't matter for your disk space, but I think you should look into upgrading.

How did you decide that im using 1.1?

My version is 1.5.4...

That's weird. @message was a standard field in old Logstash releases but the field was renamed to message. Same thing with @source IIRC. Anyway, this is unrelated to your question.

Actually I just figured out that by myself. Now my filter looks like this:

remove_field => "message", "_source" , "@source", "ident", "auth", "ZONE" but for some reason I am still seeing the source field :frowning:

Any ideas why?

The '_source' field has to be disabled in the index mapping. '_source' is useful to have, so before you remove it I would recommend looking at how you map the fields you are actually indexing and also consider whether you need the '_all' field or not. This is described in the blog post I linked to earlier.