Array support for YML environment values

PicoCreator · February 27, 2017, 4:38pm

For example, the environment value of

LOG_EXCLUDE_LINES='["^A", "^B"]'

With yml configuration line

exclude_lines: "${LOG_EXCLUDE_LINES}"

Is currently not possible, as it will be replaced with quotes.

Though, there probably need to be a very specific new syntax for such a scenario.

steffens · February 28, 2017, 12:02am

filebeat version?

these should work:

if you set exclude_lines: ${LOG_EXCLUDE_LINES} (without the double quotes) then you can configure the array in your environment variable using a 'superset' of json. e.g.

LOG_EXCLUDE_LINES='^A, ^B'

when inserting environment variables, an input value is constructed from environment variable plus surrounding string and parsed afterwards. By using quotes in your configuration file you did force the setting to be a string value, but no array.

configuration file format, plus environment variable usage is documented here. Unfortunately the docs is missing the syntax support for 'advanced' objects via environment variables or command line (-E). Best documentation I can find is in code. You can basically pass json via environment variables (or CLI) with small "enhancements":

strings can be unquoted, single-quoted or double-quoted (being lax on quotation makes it easier to handle quotation requirements on shell)
arrays at top-level do not require [], just use , to separate the elements.

Note: do not use double-quotes " for regular expressions, as \ will be interpreted as escape character.

New github issue: https://github.com/elastic/beats/issues/3686

PicoCreator · February 28, 2017, 2:39am

Currently deploying: 5.2.1

But makes sense, i was getting the errors from the "[]" array brackets, modifying directly from previous examples. This should work. Cheers!

PicoCreator · March 1, 2017, 3:07am

Slightly off topic. Does FileBeat optimises the array of exclude regex queries to evaluate as a single pass? or as individual passes for each query.

Currently in my docker FileBeat container
https://hub.docker.com/r/picoded/docker-filebeat/

I worked around the array problem, by doing the following.

exclude_lines: [".*(([0-9]{2}\[(KNL|IKE|MGR|NET)\])|(POST /_bulk HTTP/1.1\" 200 [0-9]* \"-\" \"Go-http-client/1.1)|(level\=debug.*io\.rancher)).*"]

Breaking the regex down, so its more readable. Gives this.

.*(
     ([0-9]{2}\[(KNL|IKE|MGR|NET)\])|
     (POST /_bulk HTTP/1.1\" 200 [0-9]* \"-\" \"Go-http-client/1.1)|
     (level\=debug.*io\.rancher)
).*

Originally I was hoping to set it up as an array of the following. But faced the YML array setup issue through docker environment variables.

[
   ".*[0-9]{2}\[(KNL|IKE|MGR|NET)\].*",
   ".*(POST /_bulk HTTP/1.1\" 200 [0-9]* \"-\" \"Go-http-client/1.1).*",
   ".*(level\=debug.*io\.rancher).*"
]

Subsequently, I actually tried both currently. But perhaps at my scale, or even maybe at any scale. I couldn't reliably measure any differences.

But the question is back to, if each query in the array evaluated separately, or consolidated together and called in a single pass.

I probably keep my regex query together for a single pass, if it isn't so.

steffens · March 1, 2017, 5:47pm

Well, we're talking micro-optimizations here. No idea how big an impact this will have with all filebeat machinery in place.

Using exclude_lines, every pattern will be executed after another. By putting all patterns into one big regular expression, there might be some room for improvements. On the other hand, if you can simplify the patterns to be mostly strings, with 5.3 some other improvements might kick in replacing an O(n*n) algorithm with O(n*log(n)). This PR introduces a string-matcher trying to optimize some common patterns (unfortunately your patterns doesn't fit into the 'easy, common' patterns supported yet). See commit message for some details. Changing pattern 2 to 'POST /_bulk HTTP/1.1" 200 .* "-" "Go-http-client/1\.1', pattern 2 and 3 might be potential targets for additional matcher optimizations (not yet implemented, feel free to open enhancement request). That is, long-term using arrays might be the better option.

Some notes on your regular expressions:

When string-matching, the regular expressions already search for a sub-string in the input stream. Do not use .* at the beginning or end of a regular expressions, as this increases the search-space required for back-tracking. The result will be the same.
E.g. see this micro-benchmark result (searching for substring 'PATTERN') from mentioned PR:

BenchmarkPatterns/Name=contains_'PATTERN'_with_'.*,_Matcher=Regex,_Content=mixed-4                  	   50000	     39312 ns/op
BenchmarkPatterns/Name=contains_'PATTERN',_Matcher=Regex,_Content=mixed-4                           	 1000000	      1660 ns/op

using () introduces a capture-group. as we do not want to capture any content via the regular expression use a non-capture-group (?:<regex term>).
the stdlib regex engine (well, go1.8 at least) tries to drive all 'NFA threads' of execution in parallel on stream of input. As neither of the regular expressions have a common prefix, I'd assume the both types, array style and using on big regexp via |, makes no real difference (same big O). Line filtering happens on already buffered (in memory) line + regex automaton needs to allocate some 'thread-state' (well, it's using a memory pool I think) per | sub-term.

Optimizations (1) and (2) will be automatically available in upcoming 5.3 release.

I have a tool to convert a regular expression (after parsing + default optimizations in stdlib) to graphviz for visual inspection: https://github.com/urso/anareg

Applying this to your regex:

./anareg '.*(([0-9]{2}\[(KNL|IKE|MGR|NET)\])|(POST /_bulk HTTP/1.1" 200 [0-9]* "-" "Go-http-client/1.1)|(level=debug.*io\.rancher)).*' | dot -Tpng | imgcat

I get:

with some minor optimizations (automatically applied in 5.3 release) this becomes:

./anareg '(?:\d{2}\[(?:KNL|IKE|MGR|NET)\])|(?:POST /_bulk HTTP/1\.1" 200 [0-9]* "-" "Go-http-client/1\.1)|(?:level=debug.*io\.rancher)' | dot -Tpng | ./imgcat

PicoCreator · March 3, 2017, 3:49pm

Wow, i am actually really really glad automated optimisations for (1) and (2) is already occurring.

As much as I would gladly use non capturing group, I worry that the next programmer after me, wouldn't understand it properly. This actually would make the code much easier to maintain.

Additionally, i actually forgotten how heavily go was optimised for regex, that literally makes this micro optimisation.

In that sense my experience with early day java regex libraries betray me. Because it was measurable differences the performance of a single huge regex, vs many smaller ones.

Cheers

system · March 31, 2017, 3:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.