Avoid too many useless field on elastic search - kibana


I am pretty new to this but, I have setup a elastic stack for starting a log centralization.

I have just tested with Nginx log and syslog.
My problem is the number of fields created (and many of them are useless and conflicting), even if I use the Nginx module of Logstash. I have 41 fields for my Nginx log since only 5 fields would be enough for mapping all the data.

If I'm not wrong, this is gonna take more disk space that we need.

I was thinking modules will be my solution but apparently they are not. the only solution would be to implement grok filter for each type of log ?

If you know you only need a limited set of fields then map them using an index template and turn off dynamic mapping.

Thank you for you response.

If I go this way, I will need to create an index for each type of log right ? I read that creating too many index can decrease performance, I thinks I gonna have about 15 different type of logs (taking in consideration that it will be organised by date, so typically one per day for syslog and stored for 2 weeks) , do you thinks It will be a problem (we gonna reach maybe 2000 index) ?

I do not run elasticsearch, I used to run it a couple of years ago. There are a lot of folks who know vastly more than me about elasticsearch performance. That said...

Increasing the number of indexes does impact performance. (I think it is actually the number of shards more than the number of indexes.) Whether the performance impact is too much is something only you can answer. To do that you have run performance tests with different configurations and see whether the performance meets your needs.

Years ago there were performance issues with sparse indexes (indexes that have many fields mapped, but few fields on each document). This was fixed in Elasticsearch 6.0 by upgrading to Lucene 7.0, so it is no longer a problem.

If you have two types of logs where there are conflicts between fields then they must be in separate indexes. For example, if one type of log comes through beats and has a [host][name] field, and another type of log comes in through tcp and has a [host] field which is a string then you will get mapping exceptions if you try to put them in the same index. But if the sets of fields on the two types of logs overlap but do not conflict then they can go in the same index.

When the ELK stack first came out lots of folks posted helpful blog posts and suchlike explaining how to get started. Most of those have not moved forward as the Elastic stack has advanced. The typical way to do index lifecycle management was to do daily indexes and purge them after a number of days with a tool like curator.

The default value for the index option is still logstash-%{+yyyy.MM.dd}, but by default that option is ignored now. Instead indexes are managed using elasticsearch's ILM. By default the index is rolled every 50 GB or 30 days, whichever comes first.

Do you care how long your data is kept provided it is kept at least two weeks?

Badger is right, too many shard is problem. too small is problem

I will suggest to keep it all default.
as you said you are using logstash to ingest the record, then drop particular record that you don't need it. and filebeat will put all logs in to one index filebeat--date... (you can change this name) but I will suggest to keep version-date as is.

then ILM will take care rest of them like badger said. you can adjust the ILM once it is created.
you can also take care how many shard you need etc.. via filebeat configuration.

For my system log yes. I will have just one log type that I will need to keep for at least 1 year (proxy log).

So your advice its to let the dynamic mapping do the job even if it create too many fields that I need ?

That too many fields are not using any space. they are just mapping.
what ever data you write is only thing is using space.

what this means if you go in your index pattern you will see 2000 fields but you might be using only 20 of them and hence only space use is by 20.
2000 fields are define and ready to be used that is all.

I does some test and I can't agree.
For a 20gb log raw, by default with dynamic mappings I get a 65 gb indice.
If I keep only the most importants filelds with the same log file I get a 11gb indice.

that is true.
I never said that.

Talking about mapping which has 3000 field for example
and if you write data on only 100 field then it is only using space for that.

if you write everything from your log file then defiantly it is going to use more space then when you write your data in structure manner.

It's true, but beyond that I think that if we use elastic for basic text search in logs, only mapping message field with the full log is good idea, we only keep timestamp fields and message fields which result in less disk and memory usage.