Why not always use %{DATA} as the semantic in the initial grok match?

From the grok page, the suggested matching for log entries might look like this.

%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

In other words, the suggestion is to use syntax matches as often as possible.

However, what if I also have a field called true-client-ip that may contain garbage or may contain a real ip.
I don't want my grok parse to fail if the value in the true-client-ip field does not look like an ip.

So, I'm tempted to use %{DATA} for almost all my fields, and then to add extra decoration if I can grok the field using the hoped-for syntax.

For example, I am proposing that I have an initial grok that use %{DATA} to avoid grok parse failures, and then a second grok filter that tries to match the value of the true-client-ip field and on a successful match would add a new field like valid-true-client ip.

filter {
grok {
match => { "true-client-ip" => "%{IP:valid-true-client-ip" }
}
}

Your proposed solution will work given the data content. There's definitely no reason not to use it!

So, I'm tempted to use %{DATA} for almost all my fields, and then to add extra decoration if I can grok the field using the hoped-for syntax.

The DATA pattern matches any character so you might get surprised by the results. I've seen a number of cases where people have used more than one DATA or GREEDYDATA pattern in the same expression and for some types of inputs get really weird results since either pattern matches too much.

In this particular case I'd use NOTSPACE instead of DATA , at least if the garbage IP address doesn't contain spaces. That'll also perform better than having a DATA pattern that the regexp process might need to backtrack from.

1 Like

Ohh, nice! I'll use NOTSPACE instead of DATA and gain both performance and better discrimination. Thanks.