I am pretty new to this but, I have setup a elastic stack for starting a log centralization.
I have just tested with Nginx log and syslog.
My problem is the number of fields created (and many of them are useless and conflicting), even if I use the Nginx module of Logstash. I have 41 fields for my Nginx log since only 5 fields would be enough for mapping all the data.
If I'm not wrong, this is gonna take more disk space that we need.
I was thinking modules will be my solution but apparently they are not. the only solution would be to implement grok filter for each type of log ?
If I go this way, I will need to create an index for each type of log right ? I read that creating too many index can decrease performance, I thinks I gonna have about 15 different type of logs (taking in consideration that it will be organised by date, so typically one per day for syslog and stored for 2 weeks) , do you thinks It will be a problem (we gonna reach maybe 2000 index) ?
I do not run elasticsearch, I used to run it a couple of years ago. There are a lot of folks who know vastly more than me about elasticsearch performance. That said...
Increasing the number of indexes does impact performance. (I think it is actually the number of shards more than the number of indexes.) Whether the performance impact is too much is something only you can answer. To do that you have run performance tests with different configurations and see whether the performance meets your needs.
Years ago there were performance issues with sparse indexes (indexes that have many fields mapped, but few fields on each document). This was fixed in Elasticsearch 6.0 by upgrading to Lucene 7.0, so it is no longer a problem.
If you have two types of logs where there are conflicts between fields then they must be in separate indexes. For example, if one type of log comes through beats and has a [host][name] field, and another type of log comes in through tcp and has a [host] field which is a string then you will get mapping exceptions if you try to put them in the same index. But if the sets of fields on the two types of logs overlap but do not conflict then they can go in the same index.
When the ELK stack first came out lots of folks posted helpful blog posts and suchlike explaining how to get started. Most of those have not moved forward as the Elastic stack has advanced. The typical way to do index lifecycle management was to do daily indexes and purge them after a number of days with a tool like curator.
The default value for the index option is still logstash-%{+yyyy.MM.dd}, but by default that option is ignored now. Instead indexes are managed using elasticsearch's ILM. By default the index is rolled every 50 GB or 30 days, whichever comes first.
Do you care how long your data is kept provided it is kept at least two weeks?
Badger is right, too many shard is problem. too small is problem
I will suggest to keep it all default.
as you said you are using logstash to ingest the record, then drop particular record that you don't need it. and filebeat will put all logs in to one index filebeat--date... (you can change this name) but I will suggest to keep version-date as is.
then ILM will take care rest of them like badger said. you can adjust the ILM once it is created.
you can also take care how many shard you need etc.. via filebeat configuration.
That too many fields are not using any space. they are just mapping.
what ever data you write is only thing is using space.
what this means if you go in your index pattern you will see 2000 fields but you might be using only 20 of them and hence only space use is by 20.
2000 fields are define and ready to be used that is all.
I does some test and I can't agree.
For a 20gb log raw, by default with dynamic mappings I get a 65 gb indice.
If I keep only the most importants filelds with the same log file I get a 11gb indice.
It's true, but beyond that I think that if we use elastic for basic text search in logs, only mapping message field with the full log is good idea, we only keep timestamp fields and message fields which result in less disk and memory usage.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.