Grok pattern performance


(Horst Birne) #1

Hey folks,

there was a nice article on the elastic homepage a while ago about the performance of grok filters, so i went on and set up this script to mesure throughput for the filter.

In my understand using %{DATA:sth} is mostly a bad habbit and results into slow filters so i went on tried getting rid of those and use the built-in grok patterns, however this did not result into higher throughput in the script,

An example:
the log looks like this:

2017-02-14 14:33:22\ttimezone:+1\servername(program)\t[11902] PASS username@domain IP.Ad.re.ss blagroup filterbla http://www.example.com/j/guet.gif GET

our current/previous pattern was this:

%{DATA:date}\ttimezone:%{DATA:timezone}\t%{DATA:ddd}(%{DATA:program})\t[[0-9]{1,7}][ ]{1,2}%{DATA:action}[ ]{1,2}%{DATA:username}[ ]{1,30}%{DATA:src}[ ]{1,9}%{DATA:usergroup}[ ]{1,16}%{DATA:filtergroup}[ ]{1,30}%{DATA:url}[ ]{1,7}%{DATA:method}$

with this i get about 150k/s throughput with the provided script.

I went on an replaced the %{DATA}-Parts with built-in patterns:

%{TIMESTAMP_ISO8601:date}\ttimezone:%{NOTSPACE:timezone}\t%{NOTSPACE:ddd}(%{NOTSPACE:program})\t[[0-9]{1,7}][ ]{1,2}%{WORD:action}[ ]{1,2}%{NOTSPACE:username}[ ]%{IPV4:src}[ ]{1,9}%{NOTSPACE:usergroup}[ ]{1,16}%{WORD:filtergroup}[ ]{1,30}%{URI:url}%{WORD:method}

This however resulted in 90k/s throughput.

So my question: Is my assumtion right that %{DATA} is not fast? is there any mistake i made with the patterns?

Thank you for your response


(Paris Mermigkas) #2

%{DATA} is indeed bad practice for the most part, as it involves a lot of backtracking (depending on the position inside the actual string).
%{NOTSPACE} at least for arbitrary strings like in your second example is usually better.

Another possible unnecessary overload is the IPV4 pattern, as it actually checks for valid IP addresses.
Here's what it expands to :

(?<![0-9])(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}))(?![0-9])

If you know beforehand that the IPs you will get are valid, you can substitute it with a much simpler pattern, like [0-9.]+ which could be 3-5x quicker for that part.

URI is also expanded to a very long pattern, so if you want to skip URI validation as well, %{NOTSPACE} should work equally well.


(Horst Birne) #3

Thank you very much.

Your explanation makes sense, after switching out the IPV4 , URI and Timestamp pattern i got 165k/s throughput put, so it is is faster than the DATA-string.

I will try to avoid long and expensive patterns when clearing up my configs now :slight_smile:


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.