Implement "racheting" behaviour in Ingest Pipeline

I have some log lines that have contextually appended segments. Unfortunately, we're not using a structured logging format, so there's no simple way to parse them out.

My logs have one of these forms:

 2019-01-01 00:00:00:000 GMT [pool-12-thread-34] INFO Segment1 Segment2 Segment3 Segment4 com.company.class.name - Actual log message text here
 2019-01-01 00:00:00:000 GMT [pool-12-thread-34] INFO Segment1 Segment2 Segment3 com.company.class.name - Actual log message text here
 2019-01-01 00:00:00:000 GMT [pool-12-thread-34] INFO Segment1 - Actual log message text here
 2019-01-01 00:00:00:000 GMT [pool-12-thread-34] INFO - Actual log message text here
  1. That is to say, it always starts with a timestamp, pool info, and log level.
  2. Optionally, there's the text of segment 1.
  3. Optionally, there's the text of segment 2, but only if segment 1 existed.
  4. Optionally, there's the text of segment 3, but only if segment 2 existed.
  5. Finally, there's the message

The grok processor supports if, but not in the version of ES available to me (6.4.1). Is there a better way to do this, in a performant way?

I could do this one one giant regex, but IDK how to say "if no match, don't populate the field at all". Additionally, I fear that an ill-designed regex would test for segment 2 after seeing that segment 1 didn't exist (which is wasteful).

Do you guys have any suggestion for how to implement something like this?

Currently, I'm using a regex along these lines:

 ^%{TimeStamp}%{SPACE}%{ThreadInfo}%{SPACE}(%{A:a}%{space}(%{B:b}%{space}(%{C:c}%{space})?)?)? - %{text: message}$

It works... but it's pretty nasty, and I'm not sure about how much more efficient it could get if implemented better.

Create one grok or dissect expression that parses out the fields that are common to all logs and store the remainder in a separate field. The specify a second grok pattern with a list of patterns that matches the different formats. The grok filter will match them in sequence )put the most common first) and stop one a match has been found. Doing this in two steps avoids reprising the initial fields multiple times and allows the remaining filters to fail quickly, which is good for performance.

@Christian_Dahlqvist What's the mechanism that implements this "fail fast" behavior in the ingest pipeline. Is it when a grok filter's field doesn't exist? Do further grok filters get evaluated after that?

Do you think this would be faster than the current a(b(c(d)?)?)? expression? One concern about "storing the remainder in a separate field" (even if that field is in the _ingest namespace and doesn't land in the final index document) is that it would cause a lot of string copies.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.