Parsing mixed: plain-text and json log files in logstash

Hello
I have log lines containing of two parts - plain-text and json. Example line below:

Jun 13 07:58:00 c4e-gen1 c4edlog[555007]: {"level":"info","commit":"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D","time":"2022-06-13T07:58:00Z","message":"commit synced"}

I need to parse following information from plain-text part:

hostname - c4e-gen1
process name - c4edlog
pid: - [555007]
date field - can be skipped

From the json part I need to parse all fields.

I split message into two: plain_prefix and json_segment using grok, and i have all fields from json captured as desired, but I don't know how to strip data from plain_prefix

filter {
	grok {
		match => {
			"message" => [ "(?<plain_prefix>^.*?) (?<json_segment>{.*$)" ]
		}
	}


	json {
		source => "json_segment"
	}


	mutate {
		remove_field => [ "json_segment" ]
	}

}

The result is:

{
           "level" => "info",
       "@metadata" => {
        "input" => {
            "http" => {
                "request" => {
                    "headers" => {
                         "content_length" => "390",
                        "http_user_agent" => "PostmanRuntime/7.29.0",
                            "http_accept" => "*/*",
                           "content_type" => "text/plain",
                         "request_method" => "PUT",
                           "request_path" => "/",
                              "http_host" => "localhost:8080",
                        "accept_encoding" => "gzip, deflate, br",
                           "http_version" => "HTTP/1.1",
                          "postman_token" => "4a6d8105-d613-4450-979c-20b50740acd8",
                             "connection" => "keep-alive"
                    }
                }
            }
        }
    },
          "commit" => "436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D",
         "message" => "commit synced",
             "url" => {
          "path" => "/",
          "port" => 8080,
        "domain" => "localhost"
    },
      "@timestamp" => 2022-06-13T11:41:05.022105Z,
        "@version" => "1",
            "host" => {
        "ip" => "0:0:0:0:0:0:0:1"
    },
    "plain_prefix" => "Jun 13 07:58:00 c4e-gen1 c4edlog[555007]:",
            "http" => {
        "request" => {
            "mime_type" => "text/plain",
                 "body" => {
                "bytes" => "390"
            }
        },
         "method" => "PUT",
        "version" => "HTTP/1.1"
    },
            "time" => "2022-06-13T07:58:00Z",
           "event" => {
        "original" => "Jun 13 07:58:00 c4e-gen1 c4edlog[555007]: {\"level\":\"info\",\"commit\":\"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"}"
    },
      "user_agent" => {
        "original" => "PostmanRuntime/7.29.0"
    }
}

I didn't find in existing topics example which I can adopt to my case
Would be grateful for suggestions

Hi,

You should use Grok !

Try this one :wink:

%{MONTH}%{SPACE}%{MONTHDAY}%{SPACE}%{TIME}%{SPACE}(?<hostname>(\w+\S+))%{SPACE}(?<process>(\w+\S+))(\[)(?<pid>([0-9]+))(\])(\:)%{SPACE}(?<json_message>(.+))

You could also use Split filter plugin | Logstash Reference [8.2] | Elastic

Split by spaces, use different array number for the results and put them into another field

e.g

split plain_prefix by spaces and use plain_prefix[3] for hostname.

There is a few way of doing this.

1 Like

Hello Grumo35!

Thank you for fast response :slight_smile:
I'm trying to follow your suggestion with split filter. Here what I have:

Filter config:

input {

	http {
	}

}



filter {
	grok {
		match => {
			"message" => [ "(?<plain_prefix>^.*?) (?<json_segment>{.*$)" ]
		}
	}
	
	split {
		 field => "plain_prefix" 
		 terminator => " "
 	}


	json {
		source => "json_segment"
	}


	mutate {
		remove_field => [ "json_segment" ]
	}

}

output {
	stdout { codec => rubydebug { metadata => true } }

}

This config gives me a result of each split is a complete copy of the event, with only the current split section of the given field changed.
So I have 5 copies of event with plain_prefix values:

  • Jun,
  • 13,
  • 07:58:00,
  • c4e-gen1
  • and c4edlog[555007]

Now I want to:

  1. register these values as new fields: hostname and processname with pid,
  2. remove redundant copies of event.

I'm not sure if my workflow scenario is good - maybe it's not necessary to remove redundant copies of events - but it looks a bit messy in console output.

First thre fields with date and time can be omitted because these same values are already in json_segment contained.

To achieve this I tried add_field option but I don't know how to call these values properly - i.e.: plain_prefix[3] for hostname

My config:

[...]

filter {
	grok {
		match => {
			"message" => [ "(?<plain_prefix>^.*?) (?<json_segment>{.*$)" ]
		}
	}
	
	split {
		 add_field => { "hostname1" => "[plain_prefix][3]" }
		 add_field => { "process_pid" => "[plain_prefix][4]" }
		 terminator => " "
 	}


	json {
		source => "json_segment"
	}


	mutate {
		remove_field => [ "json_segment" ]
	}

}

... and a result of this I get a string "[plain_prefix][3]" under hostname1 field instead of proper values :frowning:

Sorry for such basic questions but I'm very new to ELK and a week ago I didn't even know what it was.

Perhaps you want

split {
    field => "plain_prefix"
    add_field => {
        "hostname1" => "%{[plain_prefix][3]}"
        "process_pid" => "%{[plain_prefix][4]}"
    }
    terminator => " "
}

That is called a sprintf reference.

1 Like

Hello Badger!

No, pasting this code gives me literally in the result strings:

[...]
 "process_pid" => "%{[plain_prefix][4]}",
         "message" => "commit synced",
             "url" => {
          "path" => "/",
          "port" => 8080,
        "domain" => "localhost"
    },
      "@timestamp" => 2022-06-14T06:14:02.515077Z,
       "hostname1" => "%{[plain_prefix][3]}",
    "plain_prefix" => "07:58:00",
            "host" => {
        "ip" => "0:0:0:0:0:0:0:1"
    },
[...]

instead of it's values :frowning:

What's the raw json of the document you obtained with this config ?

You have to check inside the document the correct values you want try to open in a json parser for more clarity.

No It's not json. Its plain-text part of the message from the post #1 - "plain_prefix" field
I tried to follow your instruction from post #3

split plain_prefix by spaces and use plain_prefix[3] for hostname.

.. and split this plain_prefix part into multiple parts with split filter ant extract values to new fields using add_field options.

How to extract these values after splitting?

I understand the part of plain_prefix but in order to use sprint references you need to take a look at the "Document" (json) structure inside Elasticsearch or with logstash debug.

If the split is well configured you'll see an array of plain_prefix fields with values.

So I have split filter misconfigured - because instead of one array with 5 values i get 5 copies of event with "plain_prefix" field and subsequent values.

Example event:
Jun 13 07:58:00 c4e-gen1 c4edlog[555007]: {"level":"info","commit":"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D","time":"2022-06-13T07:58:00Z","message":"commit synced"}

Config:

input {
	http {
	}
}


filter {
	grok {
		match => {
			"message" => [ "(?<plain_prefix>^.*?) (?<json_segment>{.*$)" ]
		}
	}
	
	split {
    		field => "plain_prefix"
		terminator => " "
		}
	}


	json {
		source => "json_segment"
	}


	mutate {
		remove_field => [ "json_segment" ]
	}

}

output {
	stdout { codec => rubydebug }

And logstash debug result:

{
            "host" => {
        "ip" => "0:0:0:0:0:0:0:1"
    },
            "http" => {
        "request" => {
            "mime_type" => "text/plain",
                 "body" => {
                "bytes" => "390"
            }
        },
         "method" => "PUT",
        "version" => "HTTP/1.1"
    },
        "@version" => "1",
          "commit" => "436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D",
           "level" => "info",
      "@timestamp" => 2022-06-14T08:35:11.341181Z,
    "plain_prefix" => "Jun",
           "event" => {
        "original" => "Jun 13 07:58:00 c4e-gen1 c4edlog[555007]: {\"level\":\"info\",\"commit\":\"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"}"
    },
             "url" => {
          "port" => 8080,
          "path" => "/",
        "domain" => "localhost"
    },
         "message" => "commit synced",
            "time" => "2022-06-13T07:58:00Z",
      "user_agent" => {
        "original" => "PostmanRuntime/7.29.0"
    }
}
{
            "host" => {
        "ip" => "0:0:0:0:0:0:0:1"
    },
            "http" => {
        "request" => {
            "mime_type" => "text/plain",
                 "body" => {
                "bytes" => "390"
            }
        },
         "method" => "PUT",
        "version" => "HTTP/1.1"
    },
        "@version" => "1",
          "commit" => "436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D",
           "level" => "info",
      "@timestamp" => 2022-06-14T08:35:11.341181Z,
    "plain_prefix" => "13",
           "event" => {
        "original" => "Jun 13 07:58:00 c4e-gen1 c4edlog[555007]: {\"level\":\"info\",\"commit\":\"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"}"
    },
             "url" => {
          "port" => 8080,
          "path" => "/",
        "domain" => "localhost"
    },
         "message" => "commit synced",
            "time" => "2022-06-13T07:58:00Z",
      "user_agent" => {
        "original" => "PostmanRuntime/7.29.0"
    }
}
{
            "host" => {
        "ip" => "0:0:0:0:0:0:0:1"
    },
            "http" => {
        "request" => {
            "mime_type" => "text/plain",
                 "body" => {
                "bytes" => "390"
            }
        },
         "method" => "PUT",
        "version" => "HTTP/1.1"
    },
        "@version" => "1",
          "commit" => "436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D",
           "level" => "info",
      "@timestamp" => 2022-06-14T08:35:11.341181Z,
    "plain_prefix" => "07:58:00",
           "event" => {
        "original" => "Jun 13 07:58:00 c4e-gen1 c4edlog[555007]: {\"level\":\"info\",\"commit\":\"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"}"
    },
             "url" => {
          "port" => 8080,
          "path" => "/",
        "domain" => "localhost"
    },
         "message" => "commit synced",
            "time" => "2022-06-13T07:58:00Z",
      "user_agent" => {
        "original" => "PostmanRuntime/7.29.0"
    }
}
{
            "host" => {
        "ip" => "0:0:0:0:0:0:0:1"
    },
            "http" => {
        "request" => {
            "mime_type" => "text/plain",
                 "body" => {
                "bytes" => "390"
            }
        },
         "method" => "PUT",
        "version" => "HTTP/1.1"
    },
        "@version" => "1",
          "commit" => "436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D",
           "level" => "info",
      "@timestamp" => 2022-06-14T08:35:11.341181Z,
    "plain_prefix" => "c4e-gen1",
           "event" => {
        "original" => "Jun 13 07:58:00 c4e-gen1 c4edlog[555007]: {\"level\":\"info\",\"commit\":\"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"}"
    },
             "url" => {
          "port" => 8080,
          "path" => "/",
        "domain" => "localhost"
    },
         "message" => "commit synced",
            "time" => "2022-06-13T07:58:00Z",
      "user_agent" => {
        "original" => "PostmanRuntime/7.29.0"
    }
}
{
            "host" => {
        "ip" => "0:0:0:0:0:0:0:1"
    },
            "http" => {
        "request" => {
            "mime_type" => "text/plain",
                 "body" => {
                "bytes" => "390"
            }
        },
         "method" => "PUT",
        "version" => "HTTP/1.1"
    },
        "@version" => "1",
          "commit" => "436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D",
           "level" => "info",
      "@timestamp" => 2022-06-14T08:35:11.341181Z,
    "plain_prefix" => "c4edlog[555007]:",
           "event" => {
        "original" => "Jun 13 07:58:00 c4e-gen1 c4edlog[555007]: {\"level\":\"info\",\"commit\":\"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"}"
    },
             "url" => {
          "port" => 8080,
          "path" => "/",
        "domain" => "localhost"
    },
         "message" => "commit synced",
            "time" => "2022-06-13T07:58:00Z",
      "user_agent" => {
        "original" => "PostmanRuntime/7.29.0"
    }
}

I also tried second approach - with GROK from post #2

First - grok plain text part, and then parse json

input {
	http {
	}
}


filter {

	grok {
		match => {
			"message" => "%{MONTH}%{SPACE}%{MONTHDAY}%{SPACE}%{TIME}%{SPACE}(?<hostname>(\w+\S+))%{SPACE}(?<process>(\w+\S+))(\[)(?<pid>([0-9]+))(\])(\:)%{SPACE}(?<json_message>(.+))}"
		}
	}		



	json {
		source => "json_message"
	}

}

output {
	stdout { codec => rubydebug }

}

but with this config json part parsing does not work.
Result:

{
            "http" => {
        "request" => {
            "mime_type" => "text/plain",
                 "body" => {
                "bytes" => "390"
            }
        },
         "method" => "PUT",
        "version" => "HTTP/1.1"
    },
             "pid" => "555007",
         "process" => "c4edlog",
    "json_message" => "{\"level\":\"info\",\"commit\":\"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"",
             "url" => {
          "path" => "/",
        "domain" => "localhost",
          "port" => 8080
    },
        "@version" => "1",
         "message" => "Jun 13 07:58:00 c4e-gen1 c4edlog[555007]: {\"level\":\"info\",\"commit\":\"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"}",
            "host" => {
        "ip" => "0:0:0:0:0:0:0:1"
    },
        "hostname" => "c4e-gen1",
      "@timestamp" => 2022-06-14T10:03:46.429197Z,
           "event" => {
        "original" => "Jun 13 07:58:00 c4e-gen1 c4edlog[555007]: {\"level\":\"info\",\"commit\":\"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"}"
    },
      "user_agent" => {
        "original" => "PostmanRuntime/7.29.0"
    },
            "tags" => [
        [0] "_jsonparsefailure"
    ]
}


I see that reason of _jsonparsefailure is lack of trailing curly bracket at the end of json_message

"json_message" => "{\"level\":\"info\",\"commit\":\"436F6D6D697449447B5B3135362031303720362036362032353120323331203133302035322032382038382036352038322035302031313820313330203133392031303420353720323136203234362037302032313920313835203233302031393120323137203135382031313220313137203137302036392031355D3A34353244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"",

should be:
...3244417D\",\"time\":\"2022-06-13T07:58:00Z\",\"message\":\"commit synced\"}",

How to modify grok pattern to include last character - } ?

Hey buddy !

You can try this in kibana with your sample logs using "Dev Tools" > "Grok Debugger"

I edited the pattern to include the last curly brackets little misshaps from me :wink:

%{MONTH}%{SPACE}%{MONTHDAY}%{SPACE}%{TIME}%{SPACE}(?<hostname>(\w+\S+))%{SPACE}(?<process>(\w+\S+))(\[)(?<pid>([0-9]+))(\])(\:)%{SPACE}(?<json_message>(.+))
mutate {
	split => { "plain_prefix" => " " }
       # If you want to add the value using " " separator
        add_field => { "hostname" => "%{[plain_prefix][3]}"}
}


You can parse the whole log without grok,

If you want to isolate the PID you'll have to gsub both brackets "" with spaces in order to use the split again.

Do you got me ?

Be careful of the order of execution of mutate plugins :wink:

You should be good to go

Hello!
With corrected GROK works like a charm :). Thank you
Anyway - I'll try with split flilter on the weekend

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.