Bug in decode_json_fields: data loss, tye conflict,

fld · February 3, 2017, 8:57am

I think the following behaviour is a bug because it causes some data loss.

Please assume the following log file /tmp/sample_json.log:

{"someDate":"2016-09-28T01:40:26.760+0000", "someNumberAsString": "1475026826760", "someNumber": 1475026826760, "someString": "foobar", "someString2": "2017 is awesome"}

Now the following filebeat 5.2 configuration:

filebeat.prospectors:
- input_type: log
  paths:
    - /tmp/sample_json.log

processors:
- decode_json_fields:
    fields: ["message"]
    target: "foobar"
    max_depth: 10

output.logstash:
  hosts: ["localhost:5044"]

The this is the decoded output in logstash ruby debug output:

{
       "message" => "{\"someDate\":\"2016-09-28T01:40:26.760+0000\", \"someNumberAsString\": \"1475026826760\", \"someNumber\": 1475026826760, \"someString\": \"foobar\", \"someString2\": \"2017 is awesome\"}",
      "@version" => "1",
    "@timestamp" => "2017-02-03T08:53:57.579Z",
          "beat" => {
            "name" => "fabien1",
        "hostname" => "fabien1",
         "version" => "5.2.0"
    },
        "source" => "/tmp/sample_json.log",
        "offset" => 170,
        "foobar" => {
                  "someDate" => 2016,
                "someNumber" => 1475026826760,
        "someNumberAsString" => 1475026826760,
                "someString" => "foobar",
               "someString2" => 2017
    },
          "type" => "log",
    "input_type" => "log",
          "host" => "fabien1",
          "tags" => [
        [0] "beats_input_codec_plain_applied"
    ]
}

The problem here is that we loose the fact that someString2 was not only a number but a full string:

so if for other records it was "What an awesome year!" then we will have a type conflict: sometimes an int, sometimes a string
same for "someDate" that becomes 2016 with a lot of data loss
In conclusion, this parsing breaks the schema of the data: only json fields that were numbers should be converted as numbers.

I agree that changing this behaviour might break some compatibility for people assuming to get integers whenever possible, but the default behaviour is unusable in most cases.

ruflin · February 6, 2017, 12:19pm

@fld That definitively doesn't look correct. If you have just "2017" it should convert it to an integer, but not if you have "2017 hello world". I didn't test this yet on my side, but could you open a Github issue for that?

fld · February 6, 2017, 2:12pm

Done -> https://github.com/elastic/beats/issues/3534

fld · February 6, 2017, 2:21pm

I am actually note sure about this:

If you have just "2017" it should convert it to an integer

Indeed, json makes a distinction between string and integer (number) and filebeat should not try to do any further type conversion.

For example, one may have the following records:

{"date": "20170206T141514", "message": "foo"}
{"date": "20170206T141516", "message": "2017"}
{"date": "20170206T141516", "message": "bar"}

and then converting the second one to have an integer for the key message will cause some type inconsistencies (and for example cause ES to complain when indexing them)

ruflin · February 7, 2017, 3:31pm

That is an interesting case. We actually implemented some auto detection to improve the way it is stored. But looking at your example, this could also cause troubles. Perhaps we should have a config option to disable it?

fld · February 8, 2017, 7:46am

IMHO, filebeat is "just" a tool to pump some files and push them over the network to XXX (in our case kafka before being consumed by other services, including Logstash to push to ES, but can be push directly to ES) and that tool should not alter the data outside of the json parsing when configured. Its mission is "read one json object per line and push it with some metadata to XXX". So with that vision, having it changing data types for memory optimizations is unexpected.

And when it leads to actually causing some troubles (with the given example) then yes, there should be a way to disable such optimizations. The problem is to make that option intelligible to people not having this context. Something like "convert strings to numbers when possible" should do the trick

system · March 8, 2017, 7:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Beats not parsing `message` correctly with filebeat protocol Beats filebeat	7	168	October 11, 2024
Filebeat decode_json_fields failed Beats filebeat	1	507	January 3, 2022
Filebeat parse json Beats filebeat	8	7540	June 7, 2018
Problem with decode_json_fields Beats filebeat	5	568	November 6, 2020
Filebeat 5 decoding json probrom Beats filebeat	2	1938	August 30, 2016

Bug in decode_json_fields: data loss, tye conflict,

Related topics