System.diskio obtaining NaN (Failed to serialize the event: unsupported float value: NaN)


(Pedro González Serrano) #1

We're just setting up metricbeat and getting in a bunch of machines communication errors with logstash:

|2018-11-15T12:01:59.051+0100|INFO|pipeline/output.go:95|Connecting to backoff(async(tcp://anonDEST_HOST:anonDEST_PORT))|
|---|---|---|---|
|2018-11-15T12:01:59.088+0100|INFO|pipeline/output.go:105|Connection to backoff(async(tcp://anonDEST_HOST:anonDEST_PORT)) established|
|2018-11-15T12:01:59.130+0100|ERROR|logstash/async.go:256|Failed to publish events caused by: read tcp anonSRC_HOST:anonSRC_PORT->anonDEST_HOST:anonDEST_PORT: use of closed network connection|
|2018-11-15T12:01:59.130+0100|ERROR|logstash/async.go:256|Failed to publish events caused by: unsupported float value: NaN|

We got that with versions 6.4.2, 6.4.3 and 6.5.0

Tracing the communication errors we've found they are caused by system.diskio (disabling it makes everything work smooth)

Drilling into the issue, when using file output and only system.diskio with logging.level debug we only get in the logs:

2018-11-15T14:32:54.612+0100    WARN    fileout/file.go:127     Failed to serialize the event: unsupported float value: NaN
2018-11-15T14:32:54.612+0100    WARN    fileout/file.go:127     Failed to serialize the event: unsupported float value: NaN

Despite of the error some events get logged at the file


(Jaime Soriano) #2

Hi @niteman and welcome :slight_smile:

This looks like a bug, could you provide more details about your scenario so we can try to reproduce it? It could be specially useful to know the operating system and the configuration for diskio metricset. For this case it can be also relevant to know if you use some special storage system in this host.

Thanks for reporting!


(José Maia) #3

Reporting the same issue.

My configuration for diskio is as vanilla as it gets:

- module: system
  metricsets:
    # CPU stats
    - cpu
    (...)
    # IO stats
    - diskio
   (...)
  enabled: true
  period: 10s
  processes: ['.*']
  cpu.metrics: ["percentages", "normalized_percentages"]

running processors:
processors:
- add_docker_metadata: ~
- add_cloud_metadata: ~
- add_locale: ~

I observed the issue on 7 different machines I created over the last couple of days and added monitoring to today. They are all Ubuntu 16.04 LTS Azure VMs, created with Canonical's images.

Observed on Metricbeat 6.5.0, with Logstash at 6.4.2 and 6.5.0 (i upgraded to see if it'd fix the bug).


#4

If it helps, I am seeing the same issue on fresh install of metricbeat with Elasticsearch as output with both products on 6.5 which was working on 6.4 with same metricbeat.yml on client.

Commenting out diskio from metricbeat.yml removes the error from log.

ERROR elasticsearch/client.go:374 Failed to encode event: unsupported float value: NaN

OS on Client and Server is Debian 9.5 Stretch fully patched running on Cloudstack VM's (KVM).


(Pedro González Serrano) #5

@jsoriano details:

  • New package instal
  • Server details: Bate metal with hardware RAID controller
  • Versions:
# cat /etc/debian_version 
8.10
# metricbeat version
metricbeat version 6.5.0 (amd64), libbeat 6.5.0 [ff5b9b3db49856a25b5eda133b6997f2157a4910 built 2018-11-09 18:03:04 +0000 UTC]
  • Config:
---
name: anon_host
fields_under_root: false
logging:
  level: debug
  files:
    keepfiles: 7
    name: metricbeat
    path: "/var/log/metricbeat"
    rotateeverybytes: '10485760'
  metrics:
    enabled: false
    period: 30s 
  selectors: :undef
  to_files: true
  to_syslog: false
queue:
  mem:
    events: 4096
    flush:
      min_events: 0
      timeout: 0s
metricbeat:
  modules:
  - module: system
    metricsets:
    - diskio
    enabled: true
    period: 10s 
output:
  file:
    enabled: true
    path: "/tmp/metricbeat"

(Jaime Soriano) #6

Thanks all for the details, I have opened a new issue https://github.com/elastic/beats/issues/9124


(Steve) #7

I'm seeing the same thing on RHEL 7.3 after installing 6.5...


#8

As a workaround to get metricbeat back working I cloned the 6.5 branch and built with the changes in Cherry-pick #9125 to 6.5: Fix division by zero on diskio metricset #9137 to diskstat_linux.go and I can confirm that the issue is gone.

metricbeat version 6.5.1 (amd64), libbeat 6.5.1 [1e936aef2a0bdca6e8709b3624aa1c9dbac102b2 built 2018-11-17 03:13:22 +0000 UTC]


(John) #9

diskio is not reporting well due to this, and still having same log error with metricbeat-6.5.1-1 :frowning:


#10

From the release notes, it looks like issue #9124 was not included in the 6.5.1 release that is now in the deb repository. My recompile above that got it back working was by editing the source myself before building with the changes to diskstat_linux.go on github.