Metricbeat module docker vs memory dataset

When I try to run metricbeat 7.6.1 on a CentOS 7 box with docker-engine-17.05.0.ce-1.el7.centos.x86_64, I see these warning logged every time docker attempts to ship the memory metricset:

2020-03-24T11:14:00.316+0100<tab>ERROR<tab>elasticsearch/client.go:410<tab>Failed to encode event: unsupported float value: NaN
2020-03-24T11:14:00.316+0100<tab>ERROR<tab>elasticsearch/client.go:410<tab>Failed to encode event: unsupported float value: NaN 

<tab> = Ctrl-I chars

And thus see no doc w/event.dataset:docker.memory, wondering why?

@stefws

What modules and metricsets are enabled? What processors are enabled?

Also, how often are you seeing these errors? Is it intermittent?

Default metricsets from default docker module, errors disappears if I comment out the memory metricset. Frequency are twice per docker module cycle. Also event.dataset: docker.memory are the only datasets missing in my metricbeat index . What docker API calls are made to fetch memory datasets?

more specific these metricsets and frequency matches period:

- module: docker
  metricsets:
    - container
    - cpu
    - diskio
    - event
    - healthcheck
    - image
    - info
    #- memory
    - network
  period: 1m

and only this processor in metricbeat.yml and again warning stops by disabling the memory above in modules.d/docker.yml:

processors:
  - add_host_metadata: ~

@stefws We had some (now patched) issues in the past, due to environments where docker was being restarted, or containers were frequently being stopped and started. Does that apply to you?

Can you enable debug-level logging and give us some log lines surrounding the failed to encode events error you posted in your original message?

If possible, can you run a few curl commands against the docker engine for me?

curl -H "Content-Type: application/json" --head --unix-socket /var/run/docker.sock 'http://v1.24/_ping'

Should return the API version as something like Api-Version: 1.40

You can use this to get the raw stats object for a running container, which might help with debugging: curl -H "Content-Type: application/json" -XGET --unix-socket /var/run/docker.sock 'http://v1.40/containers/f7c3c92bf84fec74ce50645328719b5424ac16aeef3c473a5698708fe029b328/stats?stream=false'

# curl -H "Content-Type: application/json" --head --unix-socket /var/run/docker.sock 'http://v1.24/_ping'
HTTP/1.1 404 Not Found
Content-Type: application/json
Date: Thu, 26 Mar 2020 07:38:37 GMT
Content-Length: 29

Not on this specific test box...

2020-03-26T08:42:53.274+0100    DEBUG   [elasticsearch] elasticsearch/client.go:354     PublishEvents: 13 events have been published to elasticsearch in 15.6892ms.
2020-03-26T08:42:53.302+0100    DEBUG   [elasticsearch] elasticsearch/client.go:354     PublishEvents: 50 events have been published to elasticsearch in 28.216343ms.
2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/ackloop.go:160 ackloop: receive ack [0: 0, 50]
2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/ackloop.go:160 ackloop: receive ack [1: 0, 13]
2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=6, start-seq=1, end-seq=6

2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=26, start-seq=1, end-seq=26

2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=1, start-seq=1, end-seq=1

2020-03-26T08:42:53.303+0100    ERROR   elasticsearch/client.go:410     Failed to encode event: unsupported float value: NaN
2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=1, start-seq=1, end-seq=1

2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=1, start-seq=1, end-seq=1

2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=1, start-seq=1, end-seq=1

2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=6, start-seq=1, end-seq=6

2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=8, start-seq=1, end-seq=8

2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=3, start-seq=1, end-seq=3

2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=2, start-seq=1, end-seq=2

2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=8, start-seq=1, end-seq=8

2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/ackloop.go:128 ackloop: return ack to broker loop:63
2020-03-26T08:42:53.303+0100    DEBUG   [publisher]     memqueue/ackloop.go:131 ackloop:  done send ack
2020-03-26T08:42:53.303+0100    DEBUG   [elasticsearch] elasticsearch/client.go:411     Failed event: &{2020-03-26 08:42:38.620610497 +0100 CET m=+0.321576752 {} Not valid json: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value <nil> true}
2020-03-26T08:42:53.304+0100    ERROR   elasticsearch/client.go:410     Failed to encode event: unsupported float value: NaN
2020-03-26T08:42:53.304+0100    DEBUG   [elasticsearch] elasticsearch/client.go:411     Failed event: &{2020-03-26 08:42:38.620610497 +0100 CET m=+0.321576752 {} Not valid json: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value <nil> true}
2020-03-26T08:42:53.317+0100    DEBUG   [elasticsearch] elasticsearch/client.go:354     PublishEvents: 11 events have been published to elasticsearch in 14.59026ms.
2020-03-26T08:42:53.317+0100    DEBUG   [publisher]     memqueue/ackloop.go:160 ackloop: receive ack [2: 0, 13]
2020-03-26T08:42:53.317+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=5, start-seq=27, end-seq=31

2020-03-26T08:42:53.317+0100    DEBUG   [publisher]     memqueue/eventloop.go:535       broker ACK events: count=8, start-seq=7, end-seq=14

2020-03-26T08:42:53.317+0100    DEBUG   [publisher]     memqueue/ackloop.go:128 ackloop: return ack to broker loop:13
2020-03-26T08:42:53.317+0100    DEBUG   [publisher]     memqueue/ackloop.go:131 ackloop:  done send ack

So, that version of docker is old enough that it's not behaving the way I'm expecting. We can get the API version via the CLI:

docker version | grep API
 API version:     1.26
 API version:     1.26 (minimum version 1.12)

And then use this to get stats from a running container:

curl -H "Content-Type: application/json" --unix-socket /var/run/docker.sock 'http://v1.26/containers/json'

Hopefully this will help us see what's going on with the wrong data. Also, we might be able to get the raw reported metrics by using the console output. If you configure the console output and disable the elasticsearch output, we might see the "guilty" event in the journal/syslog output, depending on how you're running metricbeat.

# docker version | grep API
 API version:  1.29
 API version:  1.29 (minimum version 1.12)
[
  {
    "Id": "88e4ce3a1345bd37ed5f2ec879d1c83742e74b9a24167322527bd976d9dbb61f",
    "Names": [
      "/regitry"
    ],
    "Image": "registry:2",
    "ImageID": "sha256:d1e32b95d8e811a7c5ecddd090833dacd5f61f4f8d69eff5b37b48216
4fd39e8",
    "Command": "/entrypoint.sh /etc/docker/registry/config.yml",
    "Created": 1584794163,
    "Ports": [
      {
        "IP": "0.0.0.0",
        "PrivatePort": 5000,
        "PublicPort": 5000,
        "Type": "tcp"
      }
    ],
    "Labels": {},
    "State": "running",
    "Status": "Up 5 days",
    "HostConfig": {
      "NetworkMode": "bridge"
    },
    "NetworkSettings": {
      "Networks": {
        "bridge": {
          "IPAMConfig": null,
          "Links": null,
          "Aliases": null,
          "NetworkID": "4aa51db304b3cdecb70cfc10f7487466a188a7b97a3d1a88055b9058
4f715457",
          "EndpointID": "f40a017e412398f8c0df6aba95ac0834f779627f5bf492cf52481f5
894854fa7",
          "Gateway": "172.17.0.1",
          "IPAddress": "172.17.0.3",
          "IPPrefixLen": 16,
          "IPv6Gateway": "",
          "GlobalIPv6Address": "",
          "GlobalIPv6PrefixLen": 0,
          "MacAddress": "02:42:ac:11:00:03"
        }
      }
    },
    "Mounts": [
      {
        "Type": "bind",
        "Source": "/data/docker_registry",
        "Destination": "/var/lib/registry",
        "Mode": "",
        "RW": true,
        "Propagation": ""
      }
    ]
  },
  {
    "Id": "30bfd55813d0e242089a322093051f68599b561f386d5f5e913fd73c1d806469",
    "Names": [
      "/helloworld"
    ],
    "Image": "ubuntu:precise",
    "ImageID": "sha256:f0d07a756afde882429cfc57c83371f145ff96a5933d5b85b2a525e32
1e2ac61",
    "Command": "/bin/sh -c 'while true; do echo hello world; sleep 1; done'",
    "Created": 1584794163,
    "Ports": [],
    "Labels": {},
    "State": "running",
    "Status": "Up 5 days",
    "HostConfig": {
      "NetworkMode": "bridge"
    },
    "NetworkSettings": {
      "Networks": {
        "bridge": {
          "IPAMConfig": null,
          "Links": null,
          "Aliases": null,
          "NetworkID": "4aa51db304b3cdecb70cfc10f7487466a188a7b97a3d1a88055b9058
4f715457",
          "EndpointID": "40a4d57f1e7dc947a30484ac626260fdbca01275df8ce0448f8e78d
5e0c1ebce",
          "Gateway": "172.17.0.1",
          "IPAddress": "172.17.0.2",
          "IPPrefixLen": 16,
          "IPv6Gateway": "",
          "GlobalIPv6Address": "",
          "GlobalIPv6PrefixLen": 0,
          "MacAddress": "02:42:ac:11:00:02"
        }
      }
    },
    "Mounts": []
  }
]

Culprit is properly the memory stat returning empty/invalid json thus causing the NaN error, like debug logging showed:

Failed event: &{2020-03-26 08:42:38.620610497 +0100 CET m=+0.321576752 {} Not valid json: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value <nil> true}

docker v.1.29 shouldn't be too old or what, as docker module has been tested with version 1.11 though it also says 17.09.0-ce and I've got:

# docker version
Client:
 Version:      17.05.0-ce
 API version:  1.29

Ah, sorry, I need the stats object, not the container info.

curl -H "Content-Type: application/json" -XGET --unix-socket /var/run/docker.sock 'http://v1.29/containers/f7c3c92bf84fec74ce50645328719b5424ac16aeef3c473a5698708fe029b328/stats?stream=false'

Right now, it appears that the most likely explanation is that something in the raw metrics already has a NaN inside it, and this is causing failures when the metrics are turned into JSON objects. Right now I'm just not sure where that NaN is actually coming from. Have you tried disabling add_host_metadata ?

No NaN but it seems memory stats are empty:

{
  "read": "2020-03-27T15:07:47.887731931Z",
  "preread": "2020-03-27T15:07:46.886601035Z",
  "pids_stats": {
    "current": 2
  },
  "blkio_stats": {
    "io_service_bytes_recursive": [
      {
        "major": 253,
        "minor": 2,
        "op": "Read",
        "value": 4096
      },
      {
        "major": 253,
        "minor": 2,
        "op": "Write",
        "value": 0
      },
      {
        "major": 253,
        "minor": 2,
        "op": "Sync",
        "value": 0
      },
      {
        "major": 253,
        "minor": 2,
        "op": "Async",
        "value": 4096
      },
      {
        "major": 253,
        "minor": 2,
        "op": "Total",
        "value": 4096
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Read",
        "value": 13553664
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Write",
        "value": 0
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Sync",
        "value": 0
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Async",
        "value": 13553664
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Total",
        "value": 13553664
      },
      {
        "major": 253,
        "minor": 5,
        "op": "Read",
        "value": 13541376
      },
      {
        "major": 253,
        "minor": 5,
        "op": "Write",
        "value": 0
      },
      {
        "major": 253,
        "minor": 5,
        "op": "Sync",
        "value": 0
      },
      {
        "major": 253,
        "minor": 5,
        "op": "Async",
        "value": 13541376
      },
      {
        "major": 253,
        "minor": 5,
        "op": "Total",
        "value": 13541376
      }
    ],
    "io_serviced_recursive": [
      {
        "major": 253,
        "minor": 2,
        "op": "Read",
        "value": 1
      },
      {
        "major": 253,
        "minor": 2,
        "op": "Write",
        "value": 0
      },
      {
        "major": 253,
        "minor": 2,
        "op": "Sync",
        "value": 0
      },
      {
        "major": 253,
        "minor": 2,
        "op": "Async",
        "value": 1
      },
      {
        "major": 253,
        "minor": 2,
        "op": "Total",
        "value": 1
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Read",
        "value": 388
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Write",
        "value": 0
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Sync",
        "value": 0
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Async",
        "value": 388
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Total",
        "value": 388
      },
      {
        "major": 253,
        "minor": 5,
        "op": "Read",
        "value": 385
      },
      {
        "major": 253,
        "minor": 5,
        "op": "Write",
        "value": 0
      },
      {
        "major": 253,
        "minor": 5,
        "op": "Sync",
        "value": 0
      },
      {
        "major": 253,
        "minor": 5,
        "op": "Async",
        "value": 385
      },
      {
        "major": 253,
        "minor": 5,
        "op": "Total",
        "value": 385
      }
    ],
    "io_queue_recursive": [],
    "io_service_time_recursive": [],
    "io_wait_time_recursive": [],
    "io_merged_recursive": [],
    "io_time_recursive": [],
    "sectors_recursive": []
  },
  "num_procs": 0,
  "storage_stats": {},
  "cpu_stats": {
    "cpu_usage": {
      "total_usage": 1058704490044,
      "percpu_usage": [
        530164494788,
        528539995256
      ],
      "usage_in_kernelmode": 732500000000,
      "usage_in_usermode": 223810000000
    },
    "system_cpu_usage": 1048970030000000,
    "online_cpus": 2,
    "throttling_data": {
      "periods": 0,
      "throttled_periods": 0,
      "throttled_time": 0
    }
  },
  "precpu_stats": {
    "cpu_usage": {
      "total_usage": 1058702823121,
      "percpu_usage": [
        530164494788,
        528538328333
      ],
      "usage_in_kernelmode": 732500000000,
      "usage_in_usermode": 223810000000
    },
    "system_cpu_usage": 1048968030000000,
    "online_cpus": 2,
    "throttling_data": {
      "periods": 0,
      "throttled_periods": 0,
      "throttled_time": 0
    }
  },
  "memory_stats": {},
  "name": "/helloworld",
  "id": "30bfd55813d0e242089a322093051f68599b561f386d5f5e913fd73c1d806469",
  "networks": {
    "eth0": {
      "rx_bytes": 1312,
      "rx_packets": 16,
      "rx_errors": 0,
      "rx_dropped": 0,
      "tx_bytes": 0,
      "tx_packets": 0,
      "tx_errors": 0,
      "tx_dropped": 0
    }
  }
}

No haven't tried without add_host_metadata as turning of memory metricset solves issue :slight_smile:

Well, there's the problem. I was expecting for the stats object to be incomplete. It's missing entirely. I've never seen that before. Can you describe the install environment in more detail? Is this just Centos 7 running in a VM? How did you install docker?

Also, does this command return anything?
find /sys/fs/cgroup/memory/system.slice/docker*.scope -name "memory.stat" | xargs cat

If not, can I get the output from find /sys/fs/cgroup/memory/system.slice/docker* | grep memory

A VMware VM running:

[root ~]# cat /etc/centos-release
CentOS Linux release 7.7.1908 (Core)

I didn't install the docker, but assume a pretty regular: yum install docer-engine was used...

[root ~]# rpm -qa | grep docker
docker-engine-17.05.0.ce-1.el7.centos.x86_64
docker-engine-selinux-17.05.0.ce-1.el7.centos.noarch

Nothing in sys fs that resembles memory*:

[root ~]# find /sys/fs/cgroup/ -name "memory.stat" 
blkio/            cpuacct/          cpuset/           freezer/          net_cls/          net_prio/         pids/             
cpu/              cpu,cpuacct/      devices/          hugetlb/          net_cls,net_prio/ perf_event/       systemd/          
[root ~]# find /sys/fs/cgroup/hugetlb/ -name "memory.stat" 
cgroup.clone_children           cgroup.procs                    docker/                         hugetlb.2MB.limit_in_bytes      hugetlb.2MB.usage_in_bytes      release_agent
cgroup.event_control            cgroup.sane_behavior            hugetlb.2MB.failcnt             hugetlb.2MB.max_usage_in_bytes  notify_on_release               tasks
[root ~]# find /sys/fs/cgroup/hugetlb/docker/ -name "memory.stat" 
30bfd55813d0e242089a322093051f68599b561f386d5f5e913fd73c1d806469/ cgroup.procs                                                      hugetlb.2MB.usage_in_bytes
88e4ce3a1345bd37ed5f2ec879d1c83742e74b9a24167322527bd976d9dbb61f/ hugetlb.2MB.failcnt                                               notify_on_release
cgroup.clone_children                                             hugetlb.2MB.limit_in_bytes                                        tasks
cgroup.event_control                                              hugetlb.2MB.max_usage_in_bytes                                    
[root ~]# find /sys/fs/cgroup/hugetlb/docker/30bfd55813d0e242089a322093051f68599b561f386d5f5e913fd73c1d806469/ -name "memory.stat" 
cgroup.clone_children           cgroup.procs                    hugetlb.2MB.limit_in_bytes      hugetlb.2MB.usage_in_bytes      tasks                           
cgroup.event_control            hugetlb.2MB.failcnt             hugetlb.2MB.max_usage_in_bytes  notify_on_release               
[root ~]# find /sys/fs/cgroup/hugetlb/docker/30bfd55813d0e242089a322093051f68599b561f386d5f5e913fd73c1d806469/ -name "memory.stat" 
cgroup.clone_children           cgroup.procs                    hugetlb.2MB.limit_in_bytes      hugetlb.2MB.usage_in_bytes      tasks                           
cgroup.event_control            hugetlb.2MB.failcnt             hugetlb.2MB.max_usage_in_bytes  notify_on_release               
[root ~]# find /sys/fs/cgroup/hugetlb/docker/30bfd55813d0e242089a322093051f68599b561f386d5f5e913fd73c1d806469/ -name "memory.stat" 
cgroup.clone_children           cgroup.procs                    hugetlb.2MB.limit_in_bytes      hugetlb.2MB.usage_in_bytes      tasks                           
cgroup.event_control            hugetlb.2MB.failcnt             hugetlb.2MB.max_usage_in_bytes  notify_on_release               
[root ~]# find /sys/fs/cgroup/hugetlb/docker/30bfd55813d0e242089a322093051f68599b561f386d5f5e913fd73c1d806469/hugetlb.2MB. -name "memory.stat" 
hugetlb.2MB.failcnt             hugetlb.2MB.limit_in_bytes      hugetlb.2MB.max_usage_in_bytes  hugetlb.2MB.usage_in_bytes      
[root ~]# find /sys/fs/cgroup/hugetlb/docker/30bfd55813d0e242089a322093051f68599b561f386d5f5e913fd73c1d806469/hugetlb.2MB.usage_in_bytes -name "memory.stat" 
hugetlb.2MB.usage_in_bytes
[root ~]# find /sys/fs/cgroup/hugetlb/docker/30bfd55813d0e242089a322093051f68599b561f386d5f5e913fd73c1d806469/hugetlb.2MB.usage_in_bytes -name "memory.stat" 
[root ~]# find /sys/fs/cgroup/ -name 'memory*'
[root ~]# 

Why not make an extra assertion that json content isn't empty before attempting to parse it?

That's probably what we'll end up having to do, I'd just like to reproduce this and know exactly what kind of environment is doing this so I can better test and fix. Thankfully, this isn't too hard to fix, and we should have a fix in time for the next patch release.