Best practice for managing many log formats with data streams

DCraddockPOR · July 18, 2024, 9:35pm

Apologies if this is a redundant topic of conversation; I feel I've searched far and wide and can't find a clear answer. And, if it isn't obvious, I'm new to the Elastic ecosystem.

I have many different log formats that need to be ingested into data stream(s) via Logstash. Different log formats are distinguished by their log.file.path value. The reason I am hesitant to use one big data stream for all of them is that the amount and names of fields vary wildly between different formats, and I read somewhere that it's considered bad practice to have a large number of fields.
I have a working Logstash configuration that specifies a different data stream based on the logfile path, but also read that it's considered bad practice to have a bunch of 'small' (low document count) data streams. So, at this point, I'm unsure what is considered the "correct" way to proceed. Do I:

Route all logs (with varying formats) to the same data stream, or
separate the data streams based on the format of the logs within them, and tune the index template?

Furthermore, if I am to use the same data stream, is there some kind of ILM magic I can use to route documents to different backing indices based on their log.file.path value (in order to separate index templates by log format)?

TimV · July 19, 2024, 2:39am

What sizes are we talking about?
For some people "small" means a MB for others it means several GB.

How many nodes are in your cluster?
How much memory do they have?
How many different data sources do you have?
How big do you expect each one to be?

The answer is most likely to be:

separate the data streams based on the format of the logs within them, and tune the index template

But if we're talking about a thousand data streams, each with only a dozen documents, then the answer will change.

DCraddockPOR · July 19, 2024, 2:44pm

Thanks for your response.
The sizes also vary wildly. In terms of quantity, for some logfiles, we expect > 100k logs / day. For others, we expect ~ 20 / day. Most logs are somewhere 512B - 1KB.
Without actually checking, there should be somewhere > 30, < 50 unique logfiles. Most will be big (> 2k logs, each ~1KB big per day), I can only think of a small handful (maybe 5-10) that will have a significantly smaller footprint.

In terms of the actual hardware, we're just starting to test this stuff out and it's likely to change as needs develop. We're currently on 1 node with 8GB RAM. Still in the testing phase. Furthermore, we don't expect to use ELK for analysis of logs < 30 days old.

(edited wording)

nazgul2 · July 19, 2024, 5:54pm

DCraddockPOR:

I have many different log formats that need to be ingested into data stream(s) via Logstash. Different log formats are distinguished by their log.file.path value. The reason I am hesitant to use one big data stream for all of them is that the amount and names of fields vary wildly between different formats, and I read somewhere that it's considered bad practice to have a large number of fields.
I have a working Logstash configuration that specifies a different data stream based on the logfile path, but also read that it's considered bad practice to have a bunch of 'small' (low document count) data streams. So, at this point, I'm unsure what is considered the "correct" way to proceed. Do I:

Route all logs (with varying formats) to the same data stream, or

separate the data streams based on the format of the logs within them, and tune the index template?

Furthermore, if I am to use the same data stream, is there some kind of ILM magic I can use to route documents to different backing indices based on their log.file.path value (in order to separate index templates by log format)?

If the log formats are significantly different and you expect a high volume of logs, separating them into different data streams is generally better. This allows you to optimize mappings and index settings for each log type, which can lead to better performance and easier management in the long run. Use Logstash's conditional logic to route logs accordingly and manage each data stream's lifecycle with ILM policies tailored to the characteristics of the logs it contains.

TimV · July 22, 2024, 1:54am

I would start by separating each log source into its own data-stream.
If you have <100 data streams then you have nothing to worry about, even on a single node.

DCraddockPOR · July 22, 2024, 7:00pm

Thanks @TimV and @nazgul2 for the information and useful advice. I'll move forward separating each log format into its own data stream.

Topic		Replies	Views
Best practices for indexing log data Logstash	6	20347	October 25, 2017
Need 4 different Indexes in ElasticSearch for 4 different logs file format using single Logstash instance Logstash	4	406	June 2, 2017
Should I run multiple indexes? Elasticsearch	7	1822	July 5, 2017
Should you split indexes by category Elasticsearch	2	684	July 5, 2017
What is better - create several document types or several indices? Elasticsearch	4	377	July 6, 2017

Best practice for managing many log formats with data streams

Related topics