Failed to publish events: unsupported float value: NaN

sgerlach · February 25, 2019, 8:37am

Hello,

We have some issues with metricbeat to logstash transport. After some time (can be days after startup), metricbeat starts logging "Failed to publish events: unsupported float value: NaN" every report cycle and repeats the same event forever. That leads to billions of copies of one event, that ist delivered to different logstash instances.

Activated metricbesat modules:

system (cpu, load, core, filesystem, fsstat, memory, network, process, socket)
docker (container, cpu, diskio, healthcheck, image, info, memory, network)
elasticsearch (node, node_stats)
logstash (node, node_stats)
http (json)
redis (info, keyspace)

Our logstash transport is encrypted whith TLS.

Versions:
Oracle Linux Server release 7.6
metricbeat 6.5.4 (running in a docker container)
logstash 6.5.4 (running in a docker container)

If we can provide more information to fix this issue let us know.

Thanks and best regards
Sebastian

jsoriano · February 25, 2019, 3:24pm

Hi @sgerlach,

We had reports of NaNs with the system module but these issues were solved, there was also some NaN appearing with the docker module that I think was not solved.

Could you try to disable modules one by one to try to catch the guilty metricset?

sgerlach · February 26, 2019, 6:08am

Hi @jsoriano,

did we need to disable metrics or would it be enough to enable debug mode to catch the metrics? We can try to find it. But our Problem is that it can take up to 5 days to occur.

Beside the broken metric, can you have a look at the handing of it. One of our main problems is that not only the broken metric gets retransmitted. We receive also valid metrics over and over again if the NaN error appears. That fills up our Elasticsearch with billions copy of one event.

Thanks and best regards
Sebastian

sgerlach · March 5, 2019, 9:42am

This Grafik illustrate a peak of of events when one of our server hit the 'NaN' issue

If we dig in this peak we find a small set of events that get retransmitted over and over again.

sgerlach · March 8, 2019, 6:31am

@jsoriano it looks like the faulty metric is produced by the docker module. At this time i'm not able to tell you the exact metricset. we try to narrow it down. But with the lag in occurrence this debug isn't that easy.

ruflin · March 12, 2019, 8:29am

It seems in the screenshot above, the http.json events get retransmitted again and again as you mentioned. If something gets retransmitted because of an error, we should see it in the metricbeat logs. Any chance you could share the metricbeat logs of this instance around the time it happens?

sgerlach · March 12, 2019, 8:50am

i'm afraid there are gone. If you didn't need the exact logs it try to produce some fresh. It takes a while to trigger the effect.
The behavior is the same and the metrics that spikes my change depending on the dataset that is in the failed bulk. But i hope that isn't a problem.

ruflin · March 12, 2019, 9:04am

I don't need the exact logs. What I hope is that we see some indication which metricset and potentially which field causes the issue.

sgerlach · March 12, 2019, 9:29am

I can try to narrow it down by activate different metricsets on my servers for the docker module. I have to make some config changes and tests. So i hope i can deploy the changes tomorrow.

It would be great if you can have a look at the behavior in generell. We seen the NaN Value more than one time. And each time it makes a lot of trouble while retransmitting "healthy" Events. The NaN Value itself isn't our biggest Problem. Our big problem is the huge amount of duplicates we receive when this occurs.

sgerlach · March 13, 2019, 11:37am

{"level":"error","timestamp":"2019-03-13T12:33:07.498+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:07.499+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.499+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.24:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:07.505+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.20:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.506+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.540+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.24:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.545+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:07.586+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.586+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.27:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:07.641+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.27:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.642+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:07.719+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.720+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.23:20229))"}
{"level":"error","timestamp":"2019-03-13T12:33:07.756+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.756+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.19:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:07.756+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.23:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.757+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.789+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.19:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.790+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:07.795+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.795+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.26:20229))"}
{"level":"error","timestamp":"2019-03-13T12:33:07.839+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.839+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.21:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:07.862+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.26:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.872+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:07.880+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.21:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:07.881+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:08.125+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:08.125+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.22:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:08.158+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.22:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:08.159+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:08.451+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:08.451+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.28:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:08.476+0100","caller":"pipeline/output.go:105","message":"Connection to backoff(async(tcp://X.X.X.28:20229)) established"}
{"level":"error","timestamp":"2019-03-13T12:33:08.477+0100","caller":"logstash/async.go:256","message":"Failed to publish events caused by: unsupported float value: NaN"}
{"level":"error","timestamp":"2019-03-13T12:33:08.498+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:08.498+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.25:20229))"}
{"level":"error","timestamp":"2019-03-13T12:33:08.506+0100","caller":"pipeline/output.go:121","message":"Failed to publish events: unsupported float value: NaN"}
{"level":"info","timestamp":"2019-03-13T12:33:08.506+0100","caller":"pipeline/output.go:95","message":"Connecting to backoff(async(tcp://X.X.X.20:20229))"}
{"level":"info","timestamp":"2019-03-13T12:33:08.513+0100","caller":"cfgfile/reload.go:208","message":"Dynamic config reloader stopped"}
{"level":"info","timestamp":"2019-03-13T12:33:08.513+0100","logger":"reload","caller":"cfgfile/list.go:118","message":"Stopping 8 runners ..."}

sgerlach · March 13, 2019, 1:22pm

OK, i think i found the module: docker -> memory <-

I found a way to "reproduce" the behavior. If you restart some docker container over and over again than the NaN Values will occur. But it can take a while. I suspect that this only happens when the module try to fetch the metric while the container goes up or down.

So i have or can produce a debug log. But maybe you can try to reproduce this behavior. That should make debugging a lot easier.

Let me know if i can provide more support.

ruflin · March 14, 2019, 7:33am

Thank you for all the details and the investigation. All your testing happened on 6.5.4 or 6.6? I think we are at the point where we can call this definitively a bug. Could you open an issue here with it: https://github.com/elastic/beats/issues

I didn't reproduce it on my end yet but if you run it again and have a debug log, please link it in the Github issue that would be of great help.

sgerlach · March 14, 2019, 8:03am

All testing and debugging is done with 6.5.4. We haven't upgraded to 6.6 at this time.
Sure, i can open an issue.
Should i open 2 issues one for the "NaN Value" in docker.memory and an other for tho broken handling? or should i file a combined issue?

ruflin · March 15, 2019, 11:44am

Let's start with 1 issue. I'm suspecting that the two might be related.

sgerlach · March 18, 2019, 6:20am

Bug Report is finely there: https://github.com/elastic/beats/issues/11283

ruflin · March 18, 2019, 10:22am

Thanks

system · April 15, 2019, 10:22am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed to publish events caused by: unsupported float value: NaN Beats metricbeat	5	1465	August 23, 2018
Failed to encode event: unsupported float value: NaN Beats metricbeat	8	1523	March 21, 2019
System.diskio obtaining NaN (Failed to serialize the event: unsupported float value: NaN) Beats metricbeat	10	1623	December 19, 2018
Metricbeat 7.17.20 cannot index logstash metrics into Elasticsearch Beats docker , metricbeat	1	147	April 26, 2024
Metricbeat: Failed to encode event: unsupported float value: NaN Beats metricbeat	2	1481	August 6, 2018

Failed to publish events: unsupported float value: NaN

Related topics