Ingest pipe line grok pattern with field name having spaces

I am using a ingest pipeline to parse a tab separated log message coming from filebeat. One of the fields can have spaces. In the example below "Gui Process" should be parsed to the SourceName field. However, what happens is "Gui" gets mapped to the sourceName and my "Process" gets mapped to logType. I tried doing a custom regex (?[^)]+)\s+- instead of WORD for sourceName but didn't help. Seems like something very simple. Any help would be great. I also tried Disect but couldn't get it to work with tabs as well.

Log line:
2020-12-23T00:00:02.183-08:00 7520977794441 0x000a ABC.Laptop. Gui Process Information GDIObjects: 2078, USERHandles: 5826

Grok Pattern:
%{TIMESTAMP_ISO8601:timestamp}%{SPACE}%{NUMBER:relativeTime}%{SPACE}%{WORD:thread}%{SPACE}%{HOSTNAME:processName}%{SPACE}%{WORD:sourceName}%{SPACE}%{WORD:logType}%{SPACE}%{GREEDYDATA:message}

Expected
timestamp: 2020-12-23T00:00:02.183-08:00
relativeTime: 7520977794441
thread: 0x000a
processName: ABC.Laptop.
sourceName: Gui Process
logType: Information
message: GDIObjects: 2078, USERHandles: 5826

but get
timestamp: 2020-12-23T00:00:02.183-08:00
relativeTime: 7520977794441
thread: 0x000a
processName: ABC.Laptop.
sourceName: Gui Process
logType: Process
message: Information\tGDIObjects: 2078, USERHandles: 5826

Hi,

Doing it with dissect is not very hard.

filter {
   dissect {
     mapping => {
       "message" => "%{timestamp} %{relativeTime} %{thread} %{proccessName} %{sourceName} %{+sourceName} %{logType} %{message}"
     }
   }
}

Will give you this result.

{
         "message" => "GDIObjects: 2078, USERHandles: 5826",
         "logType" => "Information",
    "relativeTime" => "7520977794441",
      "sourceName" => "Gui Process",
        "@version" => "1",
       "timestamp" => "2020-12-23T00:00:02.183-08:00",
          "thread" => "0x000a",
    "proccessName" => "ABC.Laptop.",
      "@timestamp" => 2020-12-30T07:26:10.581Z
}

However, because you say "One of the fields can have spaces" it makes it much more complicated.

Now, assuming that

2020-12-23T00:00:02.183-08:00 7520977794441 0x000a ABC.Laptop.

and

Information GDIObjects: 2078, USERHandles: 5826

Are always build up the same and the part where you currently have "Gui Process" can differ you could decompose the log event in 3 stage.

Dissect the first part:

%{timestamp} %{relativeTime} %{thread} %{message}

Create a custom regex that groups anything until the word "Information" and store the rest in 'message'.
Something like this:

(?<processName>\w+\s\w+)\s(?<message>.+)

Then dissect the remaining message.

Not sure if this will help but it is a solution to the problem at hand.

Good luck,
Paul.

So, I was looking at some grok filter I use my self and I combine regex with grok patterns. I do this like this.

filter {
  grok {
    match => {
      "message" => "%{TIMESTAMP_ISO8601:timestamp} %{NUMBER:relativeTime} %{WORD:thread} %{HOSTNAME:processName} (?<sourceName>\w+\s\w+) %{WORD:logType} %{GREEDYDATA:message}"
    }
  }
}

This gives me the following output.

{
         "logType" => "Information",
        "@version" => "1",
      "@timestamp" => 2020-12-30T07:48:57.042Z,
      "sourceName" => "Gui Process",
       "timestamp" => "2020-12-23T00:00:02.183-08:00",
     "processName" => "ABC.Laptop.",
          "thread" => "0x000a",
    "relativeTime" => "7520977794441",
         "message" => [
        [0] "2020-12-23T00:00:02.183-08:00 7520977794441 0x000a ABC.Laptop. Gui Process Information GDIObjects: 2078, USERHandles: 5826",
        [1] "GDIObjects: 2078, USERHandles: 5826"
    ]
}

Hope this helps as well.

Paul.

Perfect exactly what I needed. Thanks a lot for the explanation. Should have posted this two days ago while I was struggling to figure it out. Do you recommend Dissect or Grok? My log lines also can have multiline. Also I came up with a solution using CSV with tab separator. That worked also but I don't think it handles multiline.

I prefer dissect, I find it easier to read in the long run. I do not know if it is faster than grok but I like to believe it is :slight_smile:

In regard to multiline. I noticed you send you events trough filebeat. You might want to do the multiline stuff there, much easier to configure as you have the events in order as the pass trough filebeat anyway.

Have a look here for multiline examples

Regards,
Paul.

Thanks I changed to Dissect and configured filebeat. Btw I did get an error when I tried the grok parser (ELK 7.9.1) but worked fine in Grok debugger. Doesn't matter since I am not using it :). Just FYI

Hi Paul

I actually changed to use the csv processor and \t as the separator. This works great but fails when the message portion has a new line character. I added the following to the filebeat.yml but hasn't helped. Loglines start with a TS like 2020-12-29T08:25:01.971....

Any thoughts?

filebeat.yml
multiline.type: pattern
multiline.pattern: '^20'
multiline.match: after
multiline.negate: true

pipeline def

  "pipeline_tab" : {
    "description" : "tab pattern",
    "processors" : [
      {
        "csv" : {
          "field" : "message",
          "target_fields" : [
            "timestamp",
            "relativeTime",
            "thread",
            "processName",
            "sourceName",
            "logType",
            "logMessage"
          ],
          "separator" : "\t"
        }
      }
    ]
  }

Hi,

Sorry for the delay. Would it be possible to share a couple of examples it would be hard to tell other wise.

Paul.

Here you go

2020-12-29T08:25:01.971-08:00	69207946792	0x0017	Tool.	BrooksRobot	EntryExit	Exiting RobotCommLib.GetReferenceStatusCommand  after 100 ms 
2020-12-29T08:25:02.071-08:00	69208046761	0x002f	Tool.	BrooksRobot	EntryExit	Entering RobotCommLib.GetCurrentThetaRZCommand 
2020-12-29T08:25:02.145-08:00	69208120079	0x0013	Tool.	OptoEvents	Background	OptoMessageReceived Enter 
 RxMsg 008,01004040020A0000,0000000A,08:25:01,
2020-12-29T08:25:02.145-08:00	69208120092	0x0013	Tool.	OptoEvents	Background	SendOptoAcknowledgement Enter 
 AckMsg = 008,01004040020A0000,0000000A,08:25:01,
2020-12-29T08:25:02.145-08:00	69208120124	0x0013	Tool.	OptoEvents	Background	OptoEventAcknowledged: eDigitalPointPushEventID
2020-12-29T08:25:02.145-08:00	69208120159	0x0014	Tool.	OptoEvents	Background	OptoEvent received message: 008,01004040020A0000,0000000A,08:25:01,

Line 3 and 4

Hi,

Your multiline pattern is not "^20" but "^ " as that is part of the previous line.

With your example, this multiline config works for me.

multiline.pattern: '^ '
multiline.negate: false
multiline.match: after

I set it to as below

multiline.type: pattern
multiline.pattern: '^20'
multiline.match: after
multiline.negate: true

because I want to treat all lines starting with 20* to be log lines. That is why I set the multiline.negate: true meaning any line that does not start with 20 should be considered in the previous line. I set multiline.match: after meaning all lines after line starting with 20* should be part of that line. I don't want to necessarily say if a line starts with a blank it is a multiline. If that is the only way to do it I guess I have no choice. Any idea why negate option and ^20 wouldn't work?