Logstash grok filter unexpected duplication of values


(Michael Green) #1

I have a simple custom grok pattern to "normalise" version numbers from an input to capture just major and minor release version information from a full version field. Some examples to give you an idea:

1.2(3)U2 => 1.2
1.2.3.12056 => 1.2
7.1 => 7.1
7 => 7.0

I have a test filter which I run from the commandline which takes stdin, does the filter and outputs to stdout. This works as expected. The full logstash config is here:

input {
        stdin { }
}

filter {
        grok {
                match => { "message" => [
                        "(?<version_normalised>^\d+$)",
                        "(?<version_normalised>^\d+\.\d+)" ]
                }
        }
        if [version_normalised] =~ /^\d+$/ {
                mutate {
                        replace => { "version_normalised" => "%{version_normalised}.0" }
                }
        }
}

output {
        stdout {
                codec => rubydebug
        }
}

Example run:
1.2(3)U2
{
            "@timestamp" => 2017-09-14T05:41:26.562Z,
    "version_normalised" => "1.2",
              "@version" => "1",
                  "host" => "ccbu-reporting",
               "message" => "1.2(3)U2"
}
7.1
{
            "@timestamp" => 2017-09-14T05:41:36.944Z,
    "version_normalised" => "7.1",
              "@version" => "1",
                  "host" => "ccbu-reporting",
               "message" => "7.1"
}
7
{
            "@timestamp" => 2017-09-14T05:41:38.412Z,
    "version_normalised" => "7.0",
              "@version" => "1",
                  "host" => "ccbu-reporting",
               "message" => "7"
}

Then I use this working filter in my main logstash config. The main difference being is the input is taken from jdbc plugin (via a database query) and the output is piped to elasticsearch plugin. The filter

input {
	jdbc {
		# JDBC connection details omitted
		# Each row of results will include a column with name "software_versions"
	}
}
filter {
	grok {
		match => { "software_versions" => [
			"(?<version_normalised>^\d+$)",
			"(?<version_normalised>^\d+\.\d+)" ]
		}
	}
	if [version_normalised] =~ /^\d+$/ {
		mutate {
			update => { "version_normalised" => "%{version_normalised}.0" }
		}
	}
}
output {
	# ElasticSearch output plugin
	elasticsearch {
		hosts => ["http://x.x.x.x:9200"] # IP address ommitted
		id => "logstash-evt"
		index => "logstash-evt-%{+xxxx.ww}"
		document_type => "evt"
		document_id => "%{evt_id}"
	}
}

However within elasticsearch the "version_normalised" has the correct value but it is duplicated in an array. For example from Kibana looking at the raw JSON snippet of the event for relevant fields below:

"software_versions": "1.2(3)",
"version_normalised": [
  "1.2",
  "1.2"
],

I am not sure why this is happening because the commandline test works as expected. Any suggestions would be appreciated, thank you.


(Christian Dahlqvist) #2

What happens if you switch the order (most specific pattern first) in your list of grok patterns and remove the mutate filter, which should not be needed?


(Michael Green) #3

I reversed the order and commented out the entire 'if' block in the filter as a test but got the same result :frowning:

FYI the mutate is required to concatenate a ".0" in case just a single number is provided such as following:
1 => 1.0
13 => 13.0


(Michael Green) #4

Christian,

For reference below is what I tried but the results were the same (apart from ".0" not being appended in the case of input being a number only).

filter  {
        grok {
                match => { "software_versions" => [
                        "(?<version_normalised>^\d+\.\d+)",
                        "(?<version_normalised>^\d+$)" ]
                }
        }
        #if [version_normalised] =~ /^\d+$/ {
        #       mutate {
        #               replace => { "version_normalised" => "%{version_normalised}.0" }
        #       }
        #}
}

I don't understand why the filter works fine when I'm just using stdin/stdout but changing to jdbc/elasticsearch filters it changes the field to an array. I don't understand why the behaviour changes in the filter just because the input/output plugin changes.

Any suggestions from anyone would be greatly appreciated I've been at this for a while now. Thanks.


(Christian Dahlqvist) #6

I would recommend to output the event to stdout with the rubydebug codec and compare the structure when using JDBC input to the stdin input.


(Michael Green) #7

I think I have narrowed it down but I'm still not sure of cause yet.

I kept adding to the test configuration until it pretty much matched production, but each step it would work. When I went back to original config after deleting out the indicies and doing fresh import it still worked. However the problem seemed to come back when I added another configuration file (which imports events from a different datasource into the same indicies). The events never overlap with this config, so I don't understand how it is relevant, but am checking further into this for now.


(Christian Dahlqvist) #8

Are you using conditionals to fully separate the flows? Logstash will concatenate all available configuration files, so sections not covered by conditionals will apply to all events.


(Michael Green) #9

Okay this I didn't realise and could explain things. Essentially I have similar events pulled from different data sources (an old system and a new system), some of the filters (like this version normalisation) applies to both data sources. To try and keep things clean I separated them into 2 different configuration files.

If it builds a single config file and combines it then I guess it would have duplicate grok commands so would be running the match on the same field twice, which could explain the result.

When you say use conditionals to separate the flows, I'm guessing you mean put the filter in an IF block. Can you give me an example of what expression you would check in that IF block so its unique between the 2 configuration files?

Thank you for the insight this seems very likely to be the problem (because I'm testing with the other configuration file in isolation and the problem didn't present itself).


(Christian Dahlqvist) #10

Typically you add a tag or type in each input and then use this to guide the processing flow.


(Michael Green) #11

Thanks. This really does sound like the problem.

I'm in the process of testing this out but running into some problems since I was doing the type mapping within the ElasticSearch output plugin, so trying to move this to the input plugin so I can use it for the conditional testing within the filter block but getting some type mapping errors now. Anyway if I can sort these out I think it should fix the duplication problem I will let you know. I didn't realise the type could be set on the input plugin but seem to be having some issues changing it there.


(Michael Green) #12

Sorry getting a little off-topic but almost there :slight_smile:

So if I add the following on my input plugin...

type => "mytype"

I have my type mapping set to strict, so I think the error I'm getting is because it is trying to add the type field to the event itself and failing due to the strict mapping. In the logstash logs I'm getting "strict_dynamic_mapping_exception" due to trying to 'dynamically add [type] within [mytype].

So I guess I can fix this by adding a field within my template mapping, according to this doc type is of value type "string", but in the template mapping string is no longer used I can only use "text" or "keyword".
(per: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#plugins-inputs-jdbc-type )

Am I heading in the right direction or missing something here? If the value type is "string" should I be using text or keyword type I thought string couldn't be used in recent versions.


(Christian Dahlqvist) #13

You can use a tag or simply add a field through add_field instead if that makes it easier. This can be in the @metadata so it does not end up in the events if necessary.


(Michael Green) #14

Thank you for your help Christian, you pointed me in the right direction and I was able to resolve the duplication issue using the metadata as you mentioned. The following link was helpful to me in my approach with a good example in case this is useful to anyone else in the future:

The type mapping errors I was getting are actually an unrelated issue I just noticed them while troubleshooting this issue and thought they were linked, so I will investigate separately.

Again, thanks for your help!


(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.