Regex to match a specific key value pair in Logstash

sandeepkanabar · May 28, 2019, 6:52pm

I'm looking to remove specific key=value pair that are inside a STRING.

Say, input event is as follows:

{
	"ABC": "10119707",
	"Request_StartTime": "1558952196175",
	"Severity": "INFO",
	"UUID": "481e8cfa-399c-4996-a4d3-7e9b7ec866fa",
	"Src_LogMsg": "type=abc, user=abc.def, vid=1111, api=fooapi, email=abc.def@gmail.com, cat=1",
	"@version": "1",
	"@timestamp": "2019-05-27T10:16:36.180Z",
	"Src_Host": "Hostname",
	"Request_IpAddress": "1.1.1.1"
}

I want to remove the key=value pair user=abc.def from the Src_LogMsg string. The following works:

filter {
  if [Src_LogMsg] =~ /.+/ {
    mutate {
      gsub =>  ["Src_LogMsg","(user=(.+?)\s)",""]
    }
  }

But if the user=abc.def is at the end of Src_LogMsg as opposed to being in middle, then the above doesn't work. Please see the below screenshots:

Here user=abc.def is in the middle with cat=1 being the last k=v pair

Here user=abc.def is at the end of Src_LogMsg string. It's not removed.

Test string from which user=xyz is successfully removed

{"ABC": "10119707", "Request_StartTime": "1558952196175", "Severity": "INFO", "UUID": "481e8cfa-399c-4996-a4d3-7e9b7ec866fa", "Src_LogMsg": "type=abc, user=abc.def, vid=1111, api=fooapi, email=abc.def@gmail.com, cat=1", "@version": "1", "@timestamp": "2019-05-27T10:16:36.180Z", "Src_Host": "Hostname","Request_IpAddress": "1.1.1.1"}

Test string from which user=xyz is NOT removed:

{"ABC": "10119707", "Request_StartTime": "1558952196175", "Severity": "INFO", "UUID": "481e8cfa-399c-4996-a4d3-7e9b7ec866fa", "Src_LogMsg": "type=abc, vid=1111, api=fooapi, email=abc.def@gmail.com, cat=1, user=abc.def", "@version": "1", "@timestamp": "2019-05-27T10:16:36.180Z", "Src_Host": "Hostname","Request_IpAddress": "1.1.1.1"}

Can someone please help me form the correct regex that will remove the user=abc.def k=v pair irrespective of its location within the Src_LogMsg field.

Logstash.conf:

input {
  stdin {
    codec => json
  }
}

filter {
  if [Src_LogMsg] =~ /.+/ {
    mutate {
      gsub =>  ["Src_LogMsg","(user=(.+?)\s)",""]
    }
  }
}

output {
  stdout { codec => rubydebug { metadata => true } }
}

Logstash Version: 5.5.1

Badger · May 28, 2019, 9:54pm

mutate { gsub => [ "Src_LogMsg", "user=[^,]+(, |$)", "" ] }

sandeepkanabar · May 29, 2019, 7:13pm

Thank you Badger. Very helpful. However, with this, for the second test case, an extra comma and space appear. Please see below screenshot

Test Case:

{"ABC": "10119707", "Request_StartTime": "1558952196175", "Severity": "INFO", "UUID": "481e8cfa-399c-4996-a4d3-7e9b7ec866fa", "Src_LogMsg": "type=abc, vid=1111, api=fooapi, email=abc.def@gmail.com, cat=1, user=abc.def", "@version": "1", "@timestamp": "2019-05-27T10:16:36.180Z", "Src_Host": "Hostname","Request_IpAddress": "1.1.1.1"}

In all, I need to handle 3 cases:

user=abc.def is at start of Src_LogMsg string
user=abc.def is in middle of Src_LogMsg string
user=abc.def is at end of Src_LogMsg string

Badger · May 29, 2019, 8:14pm

You can use a second regexp to remove the trailing comma and space.

mutate { gsub => [ "Src_LogMsg", "user=[^,]+(, |$)", "", "Src_LogMsg", ", $", "" ] }

Please do not post pictures of text. Just post the text. Thanks!

yaauie · May 29, 2019, 8:39pm

It'd be easier to simply overwrite the value with the text REDACTED or something, instead of doing multiple passes and accounting for all of the edge-cases.

filter {
  mutate {
    gsub => ["Src_LogMsg", "(?<=\buser=)[^,]+", "REDACTED"]
  }
}

The pattern (?<=\buser=)[^,]+ literally means "any string of non-comma characters that is immediately proceeded by (a word-break (\b) followed by the character sequence user=)"

sandeepkanabar · May 29, 2019, 8:44pm

Excellent suggestion and thanks for the working example. We did think about it at start. But then it means storing dummy fields in ES for Billions of records. Since this is not a field, can't remove it using prune. Thoughts?

sandeepkanabar · May 29, 2019, 8:46pm

Excellent. Thank you Badger. I also want to remove email=abc.def@gmail.com field and so I did the following

 gsub => [ "Src_LogMsg", "(email=[^,]+(, |$))|(user=[^,]+(, |$))", "", "Src_LogMsg", ", $", "" ]

Not sure if this is the most efficient way to do.

And thank you for the note that "text" is better. Agree.

yaauie · May 29, 2019, 11:34pm

This pattern is a little more sussinct (and formatted to see the multiple phases separately)

gsub => [
  "Src_LogMsg", "(\b(email|user)=[^,]+(, |$))", "",
  "Src_LogMsg", ", $", ""
]

sandeepkanabar · May 30, 2019, 12:56pm

This is great. Thank you v much! Makes the code very much succinct and easier to read. And more fields can be easily added.

system · June 27, 2019, 12:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Removing substring from key and result Logstash	2	419	August 20, 2019
Remove Specific Field matching pattern Logstash	5	778	April 19, 2023
Extracting a key value pair in which the key contains certain string Logstash	2	704	December 20, 2017
Text based log format, need help in parsing Logstash	3	371	June 19, 2019
Parsing Problem \"" Logstash	1	203	April 10, 2020

Regex to match a specific key value pair in Logstash

Related topics