Performance implication of multiple grok patterns for a single log file

hello! we're discussing in our team whether

  1. it's OK to log two set of log entries to a single file and then having logstash config specify the two patterns
    (OR)
  2. it makes sense to direct each set of entries to its own log file and have the logstash config specify one pattern per log file.

Both of them seem to work from a functional perspective, but is it true if we go with option #1 above that logstash ends up trying both patterns if need be (in the order specified in the config) to find the best match, resulting in a processing overhead?

Yes, there is overhead to "falling through" a list of patterns, but there is also overhead in maintaining separate pipelines. Depending on your patterns, though, a lot of this can be minimised.

  1. Make each pattern fail as quickly as possible on mismatched input. This includes anchoring the pattern to the beginning of your input (^) so the regex engine doesn't attempt the pattern repeatedly when it fails.
  2. If your inputs share a common prefix format, decode in phases, with one grok filter that extracts the common bits, saving the rest to a temporary field (e.g., [@metadata][rest]) and another that picks up the rest to parse it further. This way the engine doesn't have to start over and re-parse that common prefix with each subsequent pattern.
1 Like

thank you, @yaauie. can you please elaborate the overhead involved in maintaining a separate pipeline?

The idea of parsing the common bits to reduce the repeated parsing is nice, but the contents of the two sets of entries bear no relevance. it's as if the entries should be in their own log files.

The overhead of a separate pipeline is relatively low, but non-zero.

  • Depending on your queue configuration (in-memory or persisted), multiple pipelines can mean more data structures in memory or on disk. If your pipelines are consistently able to "keep up" with inbound load, this should be negligible.
  • if one pipeline is significantly more complex than the others, you may need to manually tune the number of workers per pipeline to reduce resource contention.

If they are drastically different though, I would advise separate output from your applications leading to separate pipelines. The cognitive overhead of writing a single pipeline to do two drastically different things increases the likelihood of mistakes, and in my book that far outweighs any marginal performance difference.

I would also advise starting with the Dissect filter instead of Grok, and only using Grok where your input is too complex for Dissect. It is significantly easier to get started with from a development perspective, the patterns are simpler to maintain, and as a bonus it often ends up being significantly less CPU-intensive.

1 Like

Thank you, @yaauie. Dissect filter seems to be much simpler than Grok especially when the content is predictable!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.