Logstash converts integer value in scientific notation to float, causing mapping error in Elasticsearch

Background

Given the following input (these are snippets from the input JSON Lines):

"MEMLIMIT Size":0
...
"MEMLIMIT Size":0.8590E+10

Logstash (I'm using 7.9.3) outputs:

"MEMLIMIT Size" => 0

"MEMLIMIT Size" => 8590000000.0

I'm fine with the 0 output value.

However, I'm not fine with the trailing .0 on 8590000000.0, because it causes the following mapping error:

Could not index event ... mapper [MEMLIMIT Size] cannot be changed from type [long] to [float]

I do not want to configure Logstash (or Elasticsearch, for that matter) to perform any special processing on the field "MEMLIMIT Size", because this is just one example of such a field. Other fields might also have integer values represented in scientific notation.

I have some inkling why Logstash might do this, because this particular input JSON Lines, which I helped to design, deliberately specifies a trailing .0 on integer values for fields that might contain a float value, to avoid Elastic attempting to index a float value to a field that has been incorrectly mapped as an integer based on the first indexed value.

However, in this case, the trailing .0 is undesirable.

Questions

  • How do I prevent Logstash from appending that .0? Especially, when the "10" in "E+10" shifts the decimal point way past the number of digits specified in the original value? My "inkling" aside, it's presumptuous of Logstash to specify that level of precision.

  • Alternatively, is there a way to prevent Logstash from expanding the scientific notation? Or would that just move the same problem to Elasticsearch? Would Elasticsearch expand that notation and "append" the .0, resulting in the same problem? (I haven't tested sending such scientific notation directly to Elasticsearch.)

Possible answer

Perhaps: iterate over all numeric fields; if a field value is greater than, say, 99999999999, convert (mutate?) to an integer (i.e. truncate any decimal fraction). If that sound doable, I'd appreciate help coding an efficient solution (with a Ruby .each? I'm a Ruby newbie.)

Config

Here's my Logstash config (with output set to stdout for testing; normally, it outputs to elastic; yes, I understand that line is the default codec for stdin).

input {
  stdin {
    codec => line
  }
}
filter {
  grok {
    match => { "message" => "%{TIMESTAMP_ISO8601:_time}" }
  }
  date {
    match => [ _time, ISO8601 ]
  }
  json {
    source => "message"
    remove_field => [ _time, message ]
  }
  mutate {
    lowercase => [ "code" ]
  }
}
output {
  stdout {
  }
}

You could try

    ruby {
        code => '
            event.to_hash.each { |k, v|
                if v.is_a? Float
                    if v.to_i.to_f == v
                        event.set(k, v.to_i)
                    end
                end
            }
        '
    }

If you need to iterate into nested objects then this should give you some ideas on how to do it.

Thanks very much for the Ruby.

I've made minor tweaks:

  ruby {
    code => '
      event.to_hash.each { |k, v|
        if v.is_a? Numeric
          if v > 9999999
            event.set(k, v.to_i)
          end
        end
      }
    '
  }

I used Numeric because if v.is_a? Float didn't "work" for the input "MEMLIMIT Size":0.8590E+10; Ruby doesn't recognize that input value as a Float. Which I think is actually a good thing, but then, if v.is_a? Integer doesn't work for that value, either. Too big? I couldn't find, say, a Long class in the Ruby docs. Diving into Ruby is very tempting :grinning:, but at this point I need to focus on ingesting this data in Elastic.

I decided to convert to an integer any value longer than 8 digits. That matches the point at which my input JSON Lines data starts using scientific notation.

While I'm very grateful for this code, and it gets me past this problem, I don't like having to do this at all. It feels dirty; a processing-intensive workaround that shouldn't be necessary. I view that trailing .0 as a bug. I'd value your thoughts on this. Also, if you agree with me: I'm unsure which Logstash component to raise an issue against. Without delving into the code (I know, I should), I'm unsure at which point that trailing .0 gets added (or, if it's not quite the same thing, at which point that scientific notation value gets converted to a floating-point value).

May your sett be forever safe and your litters plentiful.

I could work around this issue "upstream" by not using scientific notation in the input data.

However, I want to use scientific notation because it's compact; it avoids a whole lotta zeros.

For data that often includes big numbers, and in contexts where you pay for every byte ingested, those zeros add up. :wink:

logstash uses a third-party library to parse JSON. Jackson, I believe. So the bug, if it is a bug, would be in the third-party library.

1 Like

I checked, and you're correct: Logstash uses JrJackson, a jruby wrapper for Jackson.

GitHub doesn't show an Issues tab item for me for the Jackson project, so I created an issue for the subordinate jackson-databind project: "Large integer in scientific notation converted to float (with trailing .0)".

I might get slapped for creating that issue in the wrong place. We'll see.

I wondered whether the presence of a decimal point in the original scientific notation value had an effect on the expanded output value, so I tested an input value of 859E+7. Nope. Still came out as 8590000000.0, with the trailing .0.

After a Jackson developer asked me for a Jackson-only test case, I retreated, and instead created an issue for the Logstash project: "Logstash converts integer value in scientific notation to float, causing mapping error in Elasticsearch".

I'm not comfortable creating essentially the same issue against two different GitHub projects. On the other hand, I think I'm doing the right thing (a) reporting to the Logstash developers the effect that this behavior has on Elastic Stack and (b) reporting to the Jackson developers that, independent of Logstash, I think it's wrong to serialize a trailing .0 on an input value that does not indicate that level of precision.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.