Failed to publish events: unsupported float value: NaN


(Sebastian) #1

Hello,

We have some issues with metricbeat to logstash transport. After some time (can be days after startup), metricbeat starts logging "Failed to publish events: unsupported float value: NaN" every report cycle and repeats the same event forever. That leads to billions of copies of one event, that ist delivered to different logstash instances.

Activated metricbesat modules:

  • system (cpu, load, core, filesystem, fsstat, memory, network, process, socket)
  • docker (container, cpu, diskio, healthcheck, image, info, memory, network)
  • elasticsearch (node, node_stats)
  • logstash (node, node_stats)
  • http (json)
  • redis (info, keyspace)

Our logstash transport is encrypted whith TLS.

Versions:
Oracle Linux Server release 7.6
metricbeat 6.5.4 (running in a docker container)
logstash 6.5.4 (running in a docker container)

If we can provide more information to fix this issue let us know.

Thanks and best regards
Sebastian


(Jaime Soriano) #2

Hi @sgerlach,

We had reports of NaNs with the system module but these issues were solved, there was also some NaN appearing with the docker module that I think was not solved.

Could you try to disable modules one by one to try to catch the guilty metricset?


(Sebastian) #3

Hi @jsoriano,

did we need to disable metrics or would it be enough to enable debug mode to catch the metrics? We can try to find it. But our Problem is that it can take up to 5 days to occur.

Beside the broken metric, can you have a look at the handing of it. One of our main problems is that not only the broken metric gets retransmitted. We receive also valid metrics over and over again if the NaN error appears. That fills up our Elasticsearch with billions copy of one event.

Thanks and best regards
Sebastian


(Sebastian) #4


This Grafik illustrate a peak of of events when one of our server hit the 'NaN' issue

If we dig in this peak we find a small set of events that get retransmitted over and over again.


(Sebastian) #5

@jsoriano it looks like the faulty metric is produced by the docker module. At this time i'm not able to tell you the exact metricset. we try to narrow it down. But with the lag in occurrence this debug isn't that easy.


(ruflin) #6

It seems in the screenshot above, the http.json events get retransmitted again and again as you mentioned. If something gets retransmitted because of an error, we should see it in the metricbeat logs. Any chance you could share the metricbeat logs of this instance around the time it happens?


(Sebastian) #7

i'm afraid there are gone. If you didn't need the exact logs it try to produce some fresh. It takes a while to trigger the effect.
The behavior is the same and the metrics that spikes my change depending on the dataset that is in the failed bulk. But i hope that isn't a problem.


(ruflin) #8

I don't need the exact logs. What I hope is that we see some indication which metricset and potentially which field causes the issue.


(Sebastian) #9

I can try to narrow it down by activate different metricsets on my servers for the docker module. I have to make some config changes and tests. So i hope i can deploy the changes tomorrow.

It would be great if you can have a look at the behavior in generell. We seen the NaN Value more than one time. And each time it makes a lot of trouble while retransmitting "healthy" Events. The NaN Value itself isn't our biggest Problem. Our big problem is the huge amount of duplicates we receive when this occurs.


(Sebastian) #10
{"level":"error","timestamp":"2019-03-13T12:33:07.498+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:07.499+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.499+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.24:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:07.505+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.20:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.506+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.540+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.24:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.545+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:07.586+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.586+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.27:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:07.641+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.27:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.642+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:07.719+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.720+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.23:20229))"}
{"level":"error","timestamp":"2019-03-13T12:33:07.756+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.756+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.19:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:07.756+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.23:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.757+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.789+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.19:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.790+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:07.795+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.795+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.26:20229))"}
{"level":"error","timestamp":"2019-03-13T12:33:07.839+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.839+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.21:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:07.862+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.26:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.872+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.880+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.21:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.881+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:08.125+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:08.125+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.22:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:08.158+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.22:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:08.159+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:08.451+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:08.451+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.28:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:08.476+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.28:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:08.477+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:08.498+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:08.498+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.25:20229))"}
{"level":"error","timestamp":"2019-03-13T12:33:08.506+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:08.506+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.20:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:08.513+0100","caller":"cfgfile/reload.go:208","message":"Dynamic config reloader stopped"}
{"level":"info","timestamp":"2019-03-13T12:33:08.513+0100","logger":"reload","caller":"cfgfile/list.go:118","message":"Stopping 8 runners ..."}

(Sebastian) #11

OK, i think i found the module: docker -> memory <-

I found a way to "reproduce" the behavior. If you restart some docker container over and over again than the NaN Values will occur. But it can take a while. I suspect that this only happens when the module try to fetch the metric while the container goes up or down.

So i have or can produce a debug log. But maybe you can try to reproduce this behavior. That should make debugging a lot easier.

Let me know if i can provide more support.


(ruflin) #12

Thank you for all the details and the investigation. All your testing happened on 6.5.4 or 6.6? I think we are at the point where we can call this definitively a bug. Could you open an issue here with it: https://github.com/elastic/beats/issues

I didn't reproduce it on my end yet but if you run it again and have a debug log, please link it in the Github issue that would be of great help.


(Sebastian) #13

All testing and debugging is done with 6.5.4. We haven't upgraded to 6.6 at this time.
Sure, i can open an issue.
Should i open 2 issues one for the "NaN Value" in docker.memory and an other for tho broken handling? or should i file a combined issue?


(ruflin) #14

Let's start with 1 issue. I'm suspecting that the two might be related.


(Sebastian) #15

Bug Report is finely there: https://github.com/elastic/beats/issues/11283


(ruflin) #16

Thanks