Apologies if this is a redundant topic of conversation; I feel I've searched far and wide and can't find a clear answer. And, if it isn't obvious, I'm new to the Elastic ecosystem.
I have many different log formats that need to be ingested into data stream(s) via Logstash. Different log formats are distinguished by their log.file.path value. The reason I am hesitant to use one big data stream for all of them is that the amount and names of fields vary wildly between different formats, and I read somewhere that it's considered bad practice to have a large number of fields.
I have a working Logstash configuration that specifies a different data stream based on the logfile path, but also read that it's considered bad practice to have a bunch of 'small' (low document count) data streams. So, at this point, I'm unsure what is considered the "correct" way to proceed. Do I:
Route all logs (with varying formats) to the same data stream, or
separate the data streams based on the format of the logs within them, and tune the index template?
Furthermore, if I am to use the same data stream, is there some kind of ILM magic I can use to route documents to different backing indices based on their log.file.path value (in order to separate index templates by log format)?
Thanks for your response.
The sizes also vary wildly. In terms of quantity, for some logfiles, we expect > 100k logs / day. For others, we expect ~ 20 / day. Most logs are somewhere 512B - 1KB.
Without actually checking, there should be somewhere > 30, < 50 unique logfiles. Most will be big (> 2k logs, each ~1KB big per day), I can only think of a small handful (maybe 5-10) that will have a significantly smaller footprint.
In terms of the actual hardware, we're just starting to test this stuff out and it's likely to change as needs develop. We're currently on 1 node with 8GB RAM. Still in the testing phase. Furthermore, we don't expect to use ELK for analysis of logs < 30 days old.
If the log formats are significantly different and you expect a high volume of logs, separating them into different data streams is generally better. This allows you to optimize mappings and index settings for each log type, which can lead to better performance and easier management in the long run. Use Logstash's conditional logic to route logs accordingly and manage each data stream's lifecycle with ILM policies tailored to the characteristics of the logs it contains.
I would start by separating each log source into its own data-stream.
If you have <100 data streams then you have nothing to worry about, even on a single node.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.