Bug in decode_json_fields: data loss, tye conflict,

I think the following behaviour is a bug because it causes some data loss.

Please assume the following log file /tmp/sample_json.log:

{"someDate":"2016-09-28T01:40:26.760+0000", "someNumberAsString": "1475026826760", "someNumber": 1475026826760, "someString": "foobar", "someString2": "2017 is awesome"}

Now the following filebeat 5.2 configuration:

filebeat.prospectors:
- input_type: log
  paths:
    - /tmp/sample_json.log

processors:
- decode_json_fields:
    fields: ["message"]
    target: "foobar"
    max_depth: 10

output.logstash:
  hosts: ["localhost:5044"]

The this is the decoded output in logstash ruby debug output:

{
       "message" => "{\"someDate\":\"2016-09-28T01:40:26.760+0000\", \"someNumberAsString\": \"1475026826760\", \"someNumber\": 1475026826760, \"someString\": \"foobar\", \"someString2\": \"2017 is awesome\"}",
      "@version" => "1",
    "@timestamp" => "2017-02-03T08:53:57.579Z",
          "beat" => {
            "name" => "fabien1",
        "hostname" => "fabien1",
         "version" => "5.2.0"
    },
        "source" => "/tmp/sample_json.log",
        "offset" => 170,
        "foobar" => {
                  "someDate" => 2016,
                "someNumber" => 1475026826760,
        "someNumberAsString" => 1475026826760,
                "someString" => "foobar",
               "someString2" => 2017
    },
          "type" => "log",
    "input_type" => "log",
          "host" => "fabien1",
          "tags" => [
        [0] "beats_input_codec_plain_applied"
    ]
}

The problem here is that we loose the fact that someString2 was not only a number but a full string:

  • so if for other records it was "What an awesome year!" then we will have a type conflict: sometimes an int, sometimes a string
  • same for "someDate" that becomes 2016 with a lot of data loss
    In conclusion, this parsing breaks the schema of the data: only json fields that were numbers should be converted as numbers.

I agree that changing this behaviour might break some compatibility for people assuming to get integers whenever possible, but the default behaviour is unusable in most cases.

@fld That definitively doesn't look correct. If you have just "2017" it should convert it to an integer, but not if you have "2017 hello world". I didn't test this yet on my side, but could you open a Github issue for that?

Done -> https://github.com/elastic/beats/issues/3534

I am actually note sure about this:

If you have just "2017" it should convert it to an integer

Indeed, json makes a distinction between string and integer (number) and filebeat should not try to do any further type conversion.

For example, one may have the following records:

{"date": "20170206T141514", "message": "foo"}
{"date": "20170206T141516", "message": "2017"}
{"date": "20170206T141516", "message": "bar"}

and then converting the second one to have an integer for the key message will cause some type inconsistencies (and for example cause ES to complain when indexing them)

That is an interesting case. We actually implemented some auto detection to improve the way it is stored. But looking at your example, this could also cause troubles. Perhaps we should have a config option to disable it?

IMHO, filebeat is "just" a tool to pump some files and push them over the network to XXX (in our case kafka before being consumed by other services, including Logstash to push to ES, but can be push directly to ES) and that tool should not alter the data outside of the json parsing when configured. Its mission is "read one json object per line and push it with some metadata to XXX". So with that vision, having it changing data types for memory optimizations is unexpected.

And when it leads to actually causing some troubles (with the given example) then yes, there should be a way to disable such optimizations. The problem is to make that option intelligible to people not having this context. Something like "convert strings to numbers when possible" should do the trick

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.