Logstash grok filter unexpected duplication of values

mulgar · September 14, 2017, 5:55am

I have a simple custom grok pattern to "normalise" version numbers from an input to capture just major and minor release version information from a full version field. Some examples to give you an idea:

1.2(3)U2 => 1.2
1.2.3.12056 => 1.2
7.1 => 7.1
7 => 7.0

I have a test filter which I run from the commandline which takes stdin, does the filter and outputs to stdout. This works as expected. The full logstash config is here:

input {
        stdin { }
}

filter {
        grok {
                match => { "message" => [
                        "(?<version_normalised>^\d+$)",
                        "(?<version_normalised>^\d+\.\d+)" ]
                }
        }
        if [version_normalised] =~ /^\d+$/ {
                mutate {
                        replace => { "version_normalised" => "%{version_normalised}.0" }
                }
        }
}

output {
        stdout {
                codec => rubydebug
        }
}

Example run:
1.2(3)U2
{
            "@timestamp" => 2017-09-14T05:41:26.562Z,
    "version_normalised" => "1.2",
              "@version" => "1",
                  "host" => "ccbu-reporting",
               "message" => "1.2(3)U2"
}
7.1
{
            "@timestamp" => 2017-09-14T05:41:36.944Z,
    "version_normalised" => "7.1",
              "@version" => "1",
                  "host" => "ccbu-reporting",
               "message" => "7.1"
}
7
{
            "@timestamp" => 2017-09-14T05:41:38.412Z,
    "version_normalised" => "7.0",
              "@version" => "1",
                  "host" => "ccbu-reporting",
               "message" => "7"
}

Then I use this working filter in my main logstash config. The main difference being is the input is taken from jdbc plugin (via a database query) and the output is piped to elasticsearch plugin. The filter

input {
	jdbc {
		# JDBC connection details omitted
		# Each row of results will include a column with name "software_versions"
	}
}
filter {
	grok {
		match => { "software_versions" => [
			"(?<version_normalised>^\d+$)",
			"(?<version_normalised>^\d+\.\d+)" ]
		}
	}
	if [version_normalised] =~ /^\d+$/ {
		mutate {
			update => { "version_normalised" => "%{version_normalised}.0" }
		}
	}
}
output {
	# ElasticSearch output plugin
	elasticsearch {
		hosts => ["http://x.x.x.x:9200"] # IP address ommitted
		id => "logstash-evt"
		index => "logstash-evt-%{+xxxx.ww}"
		document_type => "evt"
		document_id => "%{evt_id}"
	}
}

However within elasticsearch the "version_normalised" has the correct value but it is duplicated in an array. For example from Kibana looking at the raw JSON snippet of the event for relevant fields below:

"software_versions": "1.2(3)",
"version_normalised": [
  "1.2",
  "1.2"
],

I am not sure why this is happening because the commandline test works as expected. Any suggestions would be appreciated, thank you.

Christian_Dahlqvist · September 14, 2017, 6:00am

What happens if you switch the order (most specific pattern first) in your list of grok patterns and remove the mutate filter, which should not be needed?

mulgar · September 14, 2017, 6:14am

I reversed the order and commented out the entire 'if' block in the filter as a test but got the same result

FYI the mutate is required to concatenate a ".0" in case just a single number is provided such as following:
1 => 1.0
13 => 13.0

mulgar · September 15, 2017, 4:12am

Christian,

For reference below is what I tried but the results were the same (apart from ".0" not being appended in the case of input being a number only).

filter  {
        grok {
                match => { "software_versions" => [
                        "(?<version_normalised>^\d+\.\d+)",
                        "(?<version_normalised>^\d+$)" ]
                }
        }
        #if [version_normalised] =~ /^\d+$/ {
        #       mutate {
        #               replace => { "version_normalised" => "%{version_normalised}.0" }
        #       }
        #}
}

I don't understand why the filter works fine when I'm just using stdin/stdout but changing to jdbc/elasticsearch filters it changes the field to an array. I don't understand why the behaviour changes in the filter just because the input/output plugin changes.

Any suggestions from anyone would be greatly appreciated I've been at this for a while now. Thanks.

Christian_Dahlqvist · September 15, 2017, 6:38am

I would recommend to output the event to stdout with the rubydebug codec and compare the structure when using JDBC input to the stdin input.

mulgar · September 15, 2017, 6:57am

I think I have narrowed it down but I'm still not sure of cause yet.

I kept adding to the test configuration until it pretty much matched production, but each step it would work. When I went back to original config after deleting out the indicies and doing fresh import it still worked. However the problem seemed to come back when I added another configuration file (which imports events from a different datasource into the same indicies). The events never overlap with this config, so I don't understand how it is relevant, but am checking further into this for now.

Christian_Dahlqvist · September 15, 2017, 7:05am

Are you using conditionals to fully separate the flows? Logstash will concatenate all available configuration files, so sections not covered by conditionals will apply to all events.

mulgar · September 15, 2017, 7:19am

Okay this I didn't realise and could explain things. Essentially I have similar events pulled from different data sources (an old system and a new system), some of the filters (like this version normalisation) applies to both data sources. To try and keep things clean I separated them into 2 different configuration files.

If it builds a single config file and combines it then I guess it would have duplicate grok commands so would be running the match on the same field twice, which could explain the result.

When you say use conditionals to separate the flows, I'm guessing you mean put the filter in an IF block. Can you give me an example of what expression you would check in that IF block so its unique between the 2 configuration files?

Thank you for the insight this seems very likely to be the problem (because I'm testing with the other configuration file in isolation and the problem didn't present itself).

Christian_Dahlqvist · September 15, 2017, 7:22am

Typically you add a tag or type in each input and then use this to guide the processing flow.

mulgar · September 15, 2017, 7:53am

Thanks. This really does sound like the problem.

I'm in the process of testing this out but running into some problems since I was doing the type mapping within the ElasticSearch output plugin, so trying to move this to the input plugin so I can use it for the conditional testing within the filter block but getting some type mapping errors now. Anyway if I can sort these out I think it should fix the duplication problem I will let you know. I didn't realise the type could be set on the input plugin but seem to be having some issues changing it there.

mulgar · September 15, 2017, 8:03am

Sorry getting a little off-topic but almost there

So if I add the following on my input plugin...

type => "mytype"

I have my type mapping set to strict, so I think the error I'm getting is because it is trying to add the type field to the event itself and failing due to the strict mapping. In the logstash logs I'm getting "strict_dynamic_mapping_exception" due to trying to 'dynamically add [type] within [mytype].

So I guess I can fix this by adding a field within my template mapping, according to this doc type is of value type "string", but in the template mapping string is no longer used I can only use "text" or "keyword".
(per: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#plugins-inputs-jdbc-type )

Am I heading in the right direction or missing something here? If the value type is "string" should I be using text or keyword type I thought string couldn't be used in recent versions.

Christian_Dahlqvist · September 15, 2017, 8:16am

You can use a tag or simply add a field through add_field instead if that makes it easier. This can be in the @metadata so it does not end up in the events if necessary.

mulgar · September 15, 2017, 11:54am

Thank you for your help Christian, you pointed me in the right direction and I was able to resolve the duplication issue using the metadata as you mentioned. The following link was helpful to me in my approach with a good example in case this is useful to anyone else in the future:

The type mapping errors I was getting are actually an unrelated issue I just noticed them while troubleshooting this issue and thought they were linked, so I will investigate separately.

Again, thanks for your help!

system · October 13, 2017, 11:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash using grok filter to modify values Logstash	3	1962	September 14, 2017
Grok Custom filter Logstash	4	360	November 7, 2018
Custom grok pattern issue Logstash	5	475	June 5, 2018
How to use grok pattern file in logstash filter Logstash	4	332	July 12, 2018
Logstash configuration: filters using GROK different logs from one type of log file Logstash	6	1405	July 6, 2017

Logstash grok filter unexpected duplication of values

Related topics