Well, we're talking micro-optimizations here. No idea how big an impact this will have with all filebeat machinery in place.
Using exclude_lines
, every pattern will be executed after another. By putting all patterns into one big regular expression, there might be some room for improvements. On the other hand, if you can simplify the patterns to be mostly strings, with 5.3 some other improvements might kick in replacing an O(n*n)
algorithm with O(n*log(n))
. This PR introduces a string-matcher trying to optimize some common patterns (unfortunately your patterns doesn't fit into the 'easy, common' patterns supported yet). See commit message for some details. Changing pattern 2 to 'POST /_bulk HTTP/1.1" 200 .* "-" "Go-http-client/1\.1'
, pattern 2 and 3 might be potential targets for additional matcher optimizations (not yet implemented, feel free to open enhancement request). That is, long-term using arrays might be the better option.
Some notes on your regular expressions:
- When string-matching, the regular expressions already search for a sub-string in the input stream. Do not use
.*
at the beginning or end of a regular expressions, as this increases the search-space required for back-tracking. The result will be the same.
E.g. see this micro-benchmark result (searching for substring 'PATTERN') from mentioned PR:
BenchmarkPatterns/Name=contains_'PATTERN'_with_'.*,_Matcher=Regex,_Content=mixed-4 50000 39312 ns/op
BenchmarkPatterns/Name=contains_'PATTERN',_Matcher=Regex,_Content=mixed-4 1000000 1660 ns/op
-
using ()
introduces a capture-group. as we do not want to capture any content via the regular expression use a non-capture-group (?:<regex term>)
.
-
the stdlib regex engine (well, go1.8 at least) tries to drive all 'NFA threads' of execution in parallel on stream of input. As neither of the regular expressions have a common prefix, I'd assume the both types, array style and using on big regexp via |
, makes no real difference (same big O). Line filtering happens on already buffered (in memory) line + regex automaton needs to allocate some 'thread-state' (well, it's using a memory pool I think) per |
sub-term.
Optimizations (1) and (2) will be automatically available in upcoming 5.3 release.
I have a tool to convert a regular expression (after parsing + default optimizations in stdlib) to graphviz for visual inspection: https://github.com/urso/anareg
Applying this to your regex:
./anareg '.*(([0-9]{2}\[(KNL|IKE|MGR|NET)\])|(POST /_bulk HTTP/1.1" 200 [0-9]* "-" "Go-http-client/1.1)|(level=debug.*io\.rancher)).*' | dot -Tpng | imgcat
I get:
with some minor optimizations (automatically applied in 5.3 release) this becomes:
./anareg '(?:\d{2}\[(?:KNL|IKE|MGR|NET)\])|(?:POST /_bulk HTTP/1\.1" 200 [0-9]* "-" "Go-http-client/1\.1)|(?:level=debug.*io\.rancher)' | dot -Tpng | ./imgcat