Metricbeat elasticsearch-xpack not reliably sending data to monitoring cluster

I've installed metricbeat to ship monitoring metrics from my elastic cluster to a separate cluster, but from what I can see, despite the period of collection being 10s, it's hardly ever sending any actual data:

Does anyone know why this might be? How can I debug this?

Thanks for any help offered!

There are metricbeat log messages in syslog every ~40s. But I hardly ever get shard monitoring data.

Hi @ndtreviv :slight_smile:

Please, can you share your full configuration and the full error log you are getting since you start metricbeat?

I'm not getting an error in the logs at all, but I'll include the messages from the syslog from start until repeating info messages, so you can see:

Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.191Z#011INFO#011instance/beat.go:660#011Home path: [/usr/share/metricbeat] Config path: [/etc/metricbeat] Data path: [/var/lib/metricbeat] Logs path: [/var/log/metricbeat]
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.191Z#011INFO#011instance/beat.go:668#011Beat ID: 4cfa5e4f-00a1-4a8c-a490-4a009a71bc97
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.192Z#011INFO#011[seccomp]#011seccomp/seccomp.go:124#011Syscall filter successfully installed
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.193Z#011INFO#011[beat]#011instance/beat.go:996#011Beat info#011{"system_info": {"beat": {"path": {"config": "/etc/metricbeat", "data": "/var/lib/metricbeat", "home": "/usr/share/metricbeat", "logs": "/var/log/metricbeat"}, "type": "metricbeat", "uuid": "4cfa5e4f-00a1-4a8c-a490-4a009a71bc97"}}}
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.193Z#011INFO#011[beat]#011instance/beat.go:1005#011Build info#011{"system_info": {"build": {"commit": "08e20483a651ea5ad60115f68ff0e53e6360573a", "libbeat": "7.12.0", "time": "2021-03-18T06:28:33.000Z", "version": "7.12.0"}}}
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.193Z#011INFO#011[beat]#011instance/beat.go:1008#011Go runtime info#011{"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":8,"version":"go1.15.8"}}}
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.193Z#011INFO#011[beat]#011instance/beat.go:1012#011Host info#011{"system_info": {"host": {"architecture":"x86_64","boot_time":"2021-06-10T09:34:43Z","containerized":false,"name":"[redacted]","ip":["127.0.0.1/8","::1/128","[redacted]/24","[redacted]/64"],"kernel_version":"4.4.0-1128-aws","mac":["0a:9c:b8:1d:dd:b4"],"os":{"type":"linux","family":"debian","platform":"ubuntu","name":"Ubuntu","version":"16.04.6 LTS (Xenial Xerus)","major":16,"minor":4,"patch":6,"codename":"xenial"},"timezone":"UTC","timezone_offset_sec":0,"id":"ec2ad126334ee7d109c14749d21476ff"}}}
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.193Z#011INFO#011[add_cloud_metadata]#011add_cloud_metadata/add_cloud_metadata.go:105#011add_cloud_metadata: hosting provider type detected as aws, metadata={"account":{"id":"986771151983"},"availability_zone":"us-east-1b","image":{"id":"[redacted]"},"instance":{"id":"[redacted]"},"machine":{"type":"i3en.2xlarge"},"provider":"aws","region":"us-east-1"}
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.193Z#011INFO#011[beat]#011instance/beat.go:1041#011Process info#011{"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"effective":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"ambient":null}, "cwd": "/", "exe": "/usr/share/metricbeat/bin/metricbeat", "name": "metricbeat", "pid": 31709, "ppid": 1, "seccomp": {"mode":"filter"}, "start_time": "2021-06-14T13:31:54.380Z"}}}
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.194Z#011INFO#011instance/beat.go:304#011Setup Beat: metricbeat; Version: 7.12.0
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.194Z#011INFO#011[index-management]#011idxmgmt/std.go:184#011Set output.elasticsearch.index to 'metricbeat-7.12.0' as ILM is enabled.
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.194Z#011WARN#011[cfgwarn]#011tlscommon/config.go:101#011DEPRECATED: Treating the CommonName field on X.509 certificates as a host name when no Subject Alternative Names are present is going to be removed. Please update your certificates if needed. Will be removed in version: 8.0.0
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.194Z#011INFO#011eslegclient/connection.go:99#011elasticsearch url: http://[redacted].compute-1.amazonaws.com:9200
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.194Z#011WARN#011[tls]#011tlscommon/tls_config.go:98#011SSL/TLS verifications disabled.
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.194Z#011INFO#011[publisher]#011pipeline/module.go:113#011Beat name: [redacted]
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.206Z#011INFO#011[monitoring]#011log/log.go:117#011Starting metrics logging every 30s
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.206Z#011INFO#011instance/beat.go:468#011metricbeat start running.
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.208Z#011INFO#011filesystem/filesystem.go:57#011Ignoring filesystem types: sysfs, rootfs, ramfs, bdev, proc, cpuset, cgroup, tmpfs, devtmpfs, debugfs, tracefs, securityfs, sockfs, bpf, pipefs, devpts, hugetlbfs, ecryptfs, fuse, fusectl, pstore, mqueue, autofs
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.208Z#011INFO#011[system.fsstat]#011fsstat/fsstat.go:57#011Ignoring filesystem types: %ssysfs, rootfs, ramfs, bdev, proc, cpuset, cgroup, tmpfs, devtmpfs, debugfs, tracefs, securityfs, sockfs, bpf, pipefs, devpts, hugetlbfs, ecryptfs, fuse, fusectl, pstore, mqueue, autofs
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.208Z#011INFO#011cfgfile/reload.go:164#011Config reloader started
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.209Z#011INFO#011filesystem/filesystem.go:57#011Ignoring filesystem types: sysfs, rootfs, ramfs, bdev, proc, cpuset, cgroup, tmpfs, devtmpfs, debugfs, tracefs, securityfs, sockfs, bpf, pipefs, devpts, hugetlbfs, ecryptfs, fuse, fusectl, pstore, mqueue, autofs
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.209Z#011INFO#011[system.fsstat]#011fsstat/fsstat.go:57#011Ignoring filesystem types: %ssysfs, rootfs, ramfs, bdev, proc, cpuset, cgroup, tmpfs, devtmpfs, debugfs, tracefs, securityfs, sockfs, bpf, pipefs, devpts, hugetlbfs, ecryptfs, fuse, fusectl, pstore, mqueue, autofs
Jun 14 13:31:55 [redacted] metricbeat[31709]: 2021-06-14T13:31:55.212Z#011INFO#011cfgfile/reload.go:224#011Loading of config files completed.
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.210Z#011INFO#011[publisher]#011pipeline/retry.go:219#011retryer: send unwait signal to consumer
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.210Z#011INFO#011[publisher]#011pipeline/retry.go:223#011  done
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.210Z#011INFO#011[publisher_pipeline_output]#011pipeline/output.go:143#011Connecting to backoff(elasticsearch(http://[redacted].compute-1.amazonaws.com:9200))
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.212Z#011INFO#011[esclientleg]#011eslegclient/connection.go:314#011Attempting to connect to Elasticsearch version 7.12.0
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.226Z#011INFO#011[license]#011licenser/es_callback.go:51#011Elasticsearch license: Basic
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.227Z#011INFO#011[esclientleg]#011eslegclient/connection.go:314#011Attempting to connect to Elasticsearch version 7.12.0
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.244Z#011INFO#011[index-management]#011idxmgmt/std.go:261#011Auto ILM enable success.
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.244Z#011INFO#011[index-management.ilm]#011ilm/std.go:139#011do not generate ilm policy: exists=true, overwrite=false
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.245Z#011INFO#011[index-management]#011idxmgmt/std.go:274#011ILM policy successfully loaded.
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.245Z#011INFO#011[index-management]#011idxmgmt/std.go:407#011Set setup.template.name to '{metricbeat-7.12.0 {now/d}-000001}' as ILM is enabled.
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.245Z#011INFO#011[index-management]#011idxmgmt/std.go:412#011Set setup.template.pattern to 'metricbeat-7.12.0-*' as ILM is enabled.
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.245Z#011INFO#011[index-management]#011idxmgmt/std.go:446#011Set settings.index.lifecycle.rollover_alias in template to {metricbeat-7.12.0 {now/d}-000001} as ILM is enabled.
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.245Z#011INFO#011[index-management]#011idxmgmt/std.go:450#011Set settings.index.lifecycle.name in template to {metricbeat {"policy":{"phases":{"hot":{"actions":{"rollover":{"max_age":"30d","max_size":"50gb"}}}}}}} as ILM is enabled.
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.245Z#011INFO#011template/load.go:97#011Template metricbeat-7.12.0 already exists and will not be overwritten.
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.245Z#011INFO#011[index-management]#011idxmgmt/std.go:298#011Loaded index template.
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.246Z#011INFO#011[index-management]#011idxmgmt/std.go:309#011Write alias successfully generated.
Jun 14 13:31:56 [redacted] metricbeat[31709]: 2021-06-14T13:31:56.246Z#011INFO#011[publisher_pipeline_output]#011pipeline/output.go:151#011Connection to backoff(elasticsearch(http://[redacted].compute-1.amazonaws.com:9200)) established

syslog then goes on to repeat the last line of this section of logs:

Jun 14 13:32:25 [redacted] metricbeat[31709]: 2021-06-14T13:32:25.208Z#011INFO#011[monitoring]#011log/log.go:144#011Non-zero metrics in the last 30s#011{"monitoring": {"metrics": {"beat":{"cgroup":{"cpu":{"cfs":{"period":{"us":100000}},"id":"metricbeat.service"},"cpuacct":{"id":"metricbeat.service","total":{"ns":545299499}},"memory":{"id":"metricbeat.service","mem":{"limit":{"bytes":9223372036854771712},"usage":{"bytes":26697728}}}},"cpu":{"system":{"ticks":210,"time":{"ms":212}},"total":{"ticks":540,"time":{"ms":544},"value":540},"user":{"ticks":330,"time":{"ms":332}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":18},"info":{"ephemeral_id":"067ec6a4-3791-4ba8-a42a-31256c556e13","uptime":{"ms":30066}},"memstats":{"gc_next":17950640,"memory_alloc":10263840,"memory_sys":76104704,"memory_total":91308232,"rss":86630400},"runtime":{"goroutines":115}},"libbeat":{"config":{"module":{"running":4,"starts":4},"reloads":1,"scans":1},"output":{"events":{"acked":56,"active":0,"batches":3,"total":56},"read":{"bytes":4471},"type":"elasticsearch","write":{"bytes":149519}},"pipeline":{"clients":19,"events":{"active":0,"filtered":6,"published":56,"retry":24,"total":62},"queue":{"acked":56}}},"metricbeat":{"elasticsearch":{"node_stats":{"events":3,"success":3}},"system":{"cpu":{"events":3,"success":3},"filesystem":{"events":10,"success":10},"fsstat":{"events":1,"success":1},"load":{"events":3,"success":3},"memory":{"events":3,"success":3},"network":{"events":8,"success":8},"process":{"events":24,"success":24},"process_summary":{"events":3,"success":3},"socket_summary":{"events":3,"success":3},"uptime":{"events":1,"success":1}}},"system":{"cpu":{"cores":8},"load":{"1":0.22,"15":0.66,"5":0.41,"norm":{"1":0.0275,"15":0.0825,"5":0.0513}}}}}}
Jun 14 13:32:55 [redacted] metricbeat[31709]: 2021-06-14T13:32:55.208Z#011INFO#011[monitoring]#011log/log.go:144#011Non-zero metrics in the last 30s#011{"monitoring": {"metrics": {"beat":{"cgroup":{"cpuacct":{"total":{"ns":342747807}},"memory":{"mem":{"usage":{"bytes":2588672}}}},"cpu":{"system":{"ticks":380,"time":{"ms":168}},"total":{"ticks":880,"time":{"ms":344},"value":880},"user":{"ticks":500,"time":{"ms":176}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":19},"info":{"ephemeral_id":"067ec6a4-3791-4ba8-a42a-31256c556e13","uptime":{"ms":60066}},"memstats":{"gc_next":21676000,"memory_alloc":17263272,"memory_sys":262144,"memory_total":152677672,"rss":87904256},"runtime":{"goroutines":115}},"libbeat":{"config":{"module":{"running":4}},"output":{"events":{"acked":48,"active":0,"batches":3,"total":48},"read":{"bytes":1441},"write":{"bytes":139085}},"pipeline":{"clients":19,"events":{"active":0,"published":48,"total":48},"queue":{"acked":48}}},"metricbeat":{"elasticsearch":{"node_stats":{"events":3,"success":3}},"system":{"cpu":{"events":3,"success":3},"load":{"events":3,"success":3},"memory":{"events":3,"success":3},"network":{"events":9,"success":9},"process":{"events":21,"success":21},"process_summary":{"events":3,"success":3},"socket_summary":{"events":3,"success":3}}},"system":{"load":{"1":0.13,"15":0.63,"5":0.37,"norm":{"1":0.0163,"15":0.0788,"5":0.0463}}}}}}
Jun 14 13:33:25 [redacted] metricbeat[31709]: 2021-06-14T13:33:25.208Z#011INFO#011[monitoring]#011log/log.go:144#011Non-zero metrics in the last 30s#011{"monitoring": {"metrics": {"beat":{"cgroup":{"cpuacct":{"total":{"ns":369604905}},"memory":{"mem":{"usage":{"bytes":-393216}}}},"cpu":{"system":{"ticks":560,"time":{"ms":188}},"total":{"ticks":1240,"time":{"ms":368},"value":1240},"user":{"ticks":680,"time":{"ms":180}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":18},"info":{"ephemeral_id":"067ec6a4-3791-4ba8-a42a-31256c556e13","uptime":{"ms":90067}},"memstats":{"gc_next":18318544,"memory_alloc":11874128,"memory_total":213928088,"rss":87187456},"runtime":{"goroutines":115}},"libbeat":{"config":{"module":{"running":4}},"output":{"events":{"acked":52,"active":0,"batches":3,"total":52},"read":{"bytes":1463},"write":{"bytes":142314}},"pipeline":{"clients":19,"events":{"active":0,"filtered":6,"published":52,"total":58},"queue":{"acked":52}}},"metricbeat":{"elasticsearch":{"node_stats":{"events":3,"success":3}},"system":{"cpu":{"events":3,"success":3},"filesystem":{"events":10,"success":10},"fsstat":{"events":1,"success":1},"load":{"events":3,"success":3},"memory":{"events":3,"success":3},"network":{"events":9,"success":9},"process":{"events":20,"success":20},"process_summary":{"events":3,"success":3},"socket_summary":{"events":3,"success":3}}},"system":{"load":{"1":0.08,"15":0.61,"5":0.34,"norm":{"1":0.01,"15":0.0763,"5":0.0425}}}}}}

The elasticsearch-xpack.yml config is vanilla:

# Module: elasticsearch
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/7.9/metricbeat-module-elasticsearch.html

- module: elasticsearch
  xpack.enabled: true
  period: 10s
  hosts: ["http://localhost:9200"]
  #username: "user"
  #password: "secret"

The main metricbeat.yml file looks like this:


metricbeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

setup.template.settings:
  index.number_of_shards: 1
  index.number_of_replicas: 0
  index.codec: best_compression

setup.kibana:
  host: "[redacted]:5601"
  ssl.verification_mode: none
  protocol: "http"

output.elasticsearch:
  hosts: ["[redacted]:9200"]
  ssl.verification_mode: none
  protocol: "http"

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - add_kubernetes_metadata: ~
  - add_fields:
      target: node
      fields:
        name: "es-live-1"
        display_name: "[redacted]:es-live-1"
        public_hostname: "[redacted]"
        cluster_name: "cluster-live"

With this setup, I have had no logs in any .monitoring-es-7-mb-* index where type: shards for hours.

8 hours later. Still no shard data being shipped.

I'm going to raise it as a bug. I've followed all the documentation, there's no documented reason why it shouldn't work.

Everything looks ok. Ok, let's go through some troubleshooting commands:

  • If you are not using ssl I think you can remove the mentions to the verification mode and the protocol. Those tls WARN you have are unexpected.
  • What's your metricbeat and elasticsearch version? On logs I can see 7.12 but on the config yaml I see 7.9
  • Check activated modules with metricbeat modules list I'm almost sure it's not this but check just in case.
  • Check the connectivity to your elasticsearch machine checking the same endpoint that shards metricset is fetching, which you can see with curl -XGET localhost:9200/_cluster/state/version,nodes,master_node,routing_table?pretty=true

Regards!

Thanks for your offer of help! Answers below:

  • If you are not using ssl I think you can remove the mentions to the verification mode and the protocol. Those tls WARN you have are unexpected.
    I've removed that line now, after a restart that warning has gone
  • What's your metricbeat and elasticsearch version? On logs I can see 7.12 but on the config yaml I see 7.9
    It's 7.12.0, but that config file says 7.9.0. I don't know why. Possibly I had an older version installed on the machine I took it from and then upgraded. That's not the case for all machines in my cluster, though. On a new node rolled in yesterday the config is:
# Module: elasticsearch
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/7.x/metricbeat-module-elasticsearch.html

- module: elasticsearch
  xpack.enabled: true
  period: 10s
  hosts: ["http://localhost:9200"]
  #username: "user"
  #password: "secret"
  metricsets:
    - node
    - node_stats
    - cluster_stats
    - index
    - index_recovery
    - shard
    - index_summary
    - pending_tasks
  • Check activated modules with metricbeat modules list I'm almost sure it's not this but check just in case.
Enabled:
elasticsearch-xpack
system

Disabled:
...everything else!
  • Check the connectivity to your elasticsearch machine checking the same endpoint that shards metricset is fetching, which you can see with curl -XGET localhost:9200/_cluster/state/version,nodes,master_node,routing_table?pretty=true
    The response comes back no problem, as expected.

This is so annoying. I can't switch over to using this until I know that metrics shipping is reliable.

I had some sporadic shard monitoring data come into my monitoring cluster the past couple of nights, and it correlates with status changes.
In other words: I only got updated shard data when a shard changed status.
So now I'm wondering whether this is a feature of the shard metricset?
It's not clear in the documentation at all, and it's incompatible with my grafana dashboards, unfortunately. But that seems to be the most likely explanation so far...

Can anyone confirm this?

Yeah I can confirm you that's a feature. The reason for that is this: Imagine a big cluster, in the order of 4000 shards. Those shards are mostly static once they reach their "destination node". So if your period is 30s it would write 4000 new documents on elasticsearch every 30 seconds with no new data on them.

So the particular use case of shards is that if the data is the same, those 4000 documents will stay there, saving you storage space.

I forgot about this completely. Sorry for not mentioning it before.

1 Like

Thanks for confirming! That makes total sense.
My problem now is that I'm trying to get a current shard count per node, like it shows in Stack Monitoring, but am unsure what queries to execute against my monitoring data in order to do this.

If you can help with that, I'd really appreciate that!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.