JSON filter wipes out existing objects in memory

I'd like to share this behaviour to find out whether or not is expected or desired.

Problem:

  • A field exists in memory (i.e. log.syslog.hostname:ZEUS1 )
  • A JSON message arrives that includes a sub-field of the existing object (i.e. log.level:INFO )
  • The previously existing field disappears and it is not present at the output

Expected behaviour:

Both fields (the existing and the new one) are present at the output.

To reproduce:

Logstash 8.3.3

input {
  generator {
    message => '{ "log": { "level": "INFO" } }'
    count => 1
  }
}

filter {

  mutate { add_field => { "[log][syslog][hostname]" => "ZEUS1" } }

  json {
    source => "message"
    remove_field => [ "message" ]
  }

}

output {
  stdout {}
}

Output:

{
         "event" => {
        "original" => "{ \"log\": { \"level\": \"INFO\" } }",
        "sequence" => 0
    },
    "@timestamp" => 2022-08-13T06:54:52.835787Z,
           "log" => {
        "level" => "INFO"
    },
          "host" => {
        "name" => "local"
    },
      "@version" => "1"
}

This is specially impactful when the source data is not under one's control and can change overtime.
A way to mitigate is to unpack the JSON message under target and rename trusted fields back to the root. But this unsustainable when dealing with big and changing schemas.

Shall this behaviour change?

Thanks

This is expected and documented.

In the documentation for the json filter you have this note in the target option.

if the target field already exists, it will be overwritten!

Since you didn't set a target field, the json will be expanded into the root document, which is also documented.

If this setting is omitted, the JSON data will be stored at the root (top level) of the event.

So, if you create a json object named log and the json message that you will parse also have a json object named log, it will override the existent object.

To solve issues like this parse your message before adding any field into the document, if you put your mutate filter after your json filter, it will work.

I would say that the first thing you should do in the filter block of any logstash pipeline is to parse the source message.

Thank you for your answer but I beg to differ.

The fact that a behaviour is documented does not make necessarily right in my opinion.

The behaviour that the JSON filter exhibits goes against the general Logstash behaviour.

For example, the mutate filter, in a situation like this one:

mutate { add_field => { "[log][level]" => "INFO" } }
mutate { add_field => { "[log][syslog][hostname]" => "ZEUS1" } }

won't overwrite any field. Both will be present at the output.

The same goes for the filter Grok:

input {
  generator {
    message => 'INFO'
    count => 1
  }
}

filter {

  mutate { add_field => { "[log][syslog][hostname]" => "ZEUS1" } }

  grok { match => { "message" => [ "%{LOGLEVEL:[log][level]}" ] } }

}

output {
  stdout {}
}

No overwrite:

       "log" => {
         "level" => "INFO",
        "syslog" => {
            "hostname" => "ZEUS1"
        }

and I'm sure there are other examples.

This overwrite "feature" implies that we need to know beforehand the fields that will come over JSON. This filter does not provide a exclude_keys option like the KV filter does that could be used to protect the root of the event from harmful keys.

You are right, there is the target option. But that also implies that we know the present and future keys that our developers will throw at us and put in place (and maintain) a number of renames to move those keys to the root of the document.

And in real life scenarios, when the configuration gets big, is not always an option to put our JSON logic at the top.

I know that this has been like this for ages and that many of us have been working around the issue but when you realise that in my example log.level and log.syslog.hostname are two different fields, it is hard to justify that this filter is doing the right thing silently deleting data.

Your examples are adding nested fields to a top-level field, if you try to do any operation with the top-level field you may get errors or no change at all.

For example, if you try this:

  mutate { add_field => { "[log][syslog][hostname]" => "ZEUS1" } }
  grok { match => { "message" => [ "%{LOGLEVEL:[log]}" ] } }

You will see that your grok filter will do nothing, because the top-level field log already exists and is an object.

But if you try this:

  grok { match => { "message" => [ "%{LOGLEVEL:[log]}" ] } }
  mutate { add_field => { "[log][syslog][hostname]" => "ZEUS1" } }

You will get a mutate error, mutate cannot add the field because the log field already exists and it is not an object.

Suppose now that you are adding this field in your pipeline:

  mutate { add_field => { "[log][syslog][hostname]" => "ZEUS1" } }

And further down in your pipeline you have the following json filter:

json {
    source => "message"
}

And for some reason your message is this { "log": "some message" }, what should logstash do in this case? You already have the top-level field log as an json object, but the json filter wants to unpack another field named log, but as a string field.

If the json filter tried to merge the fields instead of overriding, you would got an error since you can't have the top-level field as a json object and also as a string.

Now suppose that you are adding this field:

  mutate { add_field => { "[log][severity]" => 1 } }

This would make log.severity as a numeric field and would be mapped as a numeric type in elasticsearch.

If the json filter merge the fields and you got a message with { "log": { "severity": "low" } }, you would end with this in logstash

log.severity: [1 , "low"]

But this is not supported in elasticsearch since the log.severity has a numeric type, so your entire document would be rejected while trying to index.

You are thinking in scenarios where both the existing field and the json source message are objects, but you need to think that you may have some cases where you may have conflicting types.

The main issue is: What should Logstash do if you have the same top-level field name as different types? What field should prevail? The existing one or the new one?

For me, this alone justify the json filter overriding the target field.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.