Implement "racheting" behaviour in Ingest Pipeline

amomchilov · March 8, 2019, 9:45pm

I have some log lines that have contextually appended segments. Unfortunately, we're not using a structured logging format, so there's no simple way to parse them out.

My logs have one of these forms:

 2019-01-01 00:00:00:000 GMT [pool-12-thread-34] INFO Segment1 Segment2 Segment3 Segment4 com.company.class.name - Actual log message text here
 2019-01-01 00:00:00:000 GMT [pool-12-thread-34] INFO Segment1 Segment2 Segment3 com.company.class.name - Actual log message text here
 2019-01-01 00:00:00:000 GMT [pool-12-thread-34] INFO Segment1 - Actual log message text here
 2019-01-01 00:00:00:000 GMT [pool-12-thread-34] INFO - Actual log message text here

That is to say, it always starts with a timestamp, pool info, and log level.
Optionally, there's the text of segment 1.
Optionally, there's the text of segment 2, but only if segment 1 existed.
Optionally, there's the text of segment 3, but only if segment 2 existed.
Finally, there's the message

The grok processor supports if, but not in the version of ES available to me (6.4.1). Is there a better way to do this, in a performant way?

I could do this one one giant regex, but IDK how to say "if no match, don't populate the field at all". Additionally, I fear that an ill-designed regex would test for segment 2 after seeing that segment 1 didn't exist (which is wasteful).

Do you guys have any suggestion for how to implement something like this?

Currently, I'm using a regex along these lines:

 ^%{TimeStamp}%{SPACE}%{ThreadInfo}%{SPACE}(%{A:a}%{space}(%{B:b}%{space}(%{C:c}%{space})?)?)? - %{text: message}$

It works... but it's pretty nasty, and I'm not sure about how much more efficient it could get if implemented better.

Christian_Dahlqvist · March 10, 2019, 11:00am

Create one grok or dissect expression that parses out the fields that are common to all logs and store the remainder in a separate field. The specify a second grok pattern with a list of patterns that matches the different formats. The grok filter will match them in sequence )put the most common first) and stop one a match has been found. Doing this in two steps avoids reprising the initial fields multiple times and allows the remaining filters to fail quickly, which is good for performance.

amomchilov · March 11, 2019, 6:16pm

@Christian_Dahlqvist What's the mechanism that implements this "fail fast" behavior in the ingest pipeline. Is it when a grok filter's field doesn't exist? Do further grok filters get evaluated after that?

Do you think this would be faster than the current a(b(c(d)?)?)? expression? One concern about "storing the remainder in a separate field" (even if that field is in the _ingest namespace and doesn't land in the final index document) is that it would cause a lot of string copies.

system · April 8, 2019, 6:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Parsing multiple date types from message field with ingest node Elasticsearch	3	1612	October 24, 2018
Parsing logs with a value_split Logstash	3	16	September 25, 2024
[Ingest pipeline] tag and trace match Elasticsearch ingest-pipeline	1	942	March 31, 2022
Trouble using ingest pipeline to parse two different log formats Elasticsearch	3	599	January 3, 2017
Ingest pipeline - extract regex from events Elasticsearch painless , ingest-pipeline	2	541	November 14, 2023

Implement "racheting" behaviour in Ingest Pipeline

Related topics