Hi.
I have set up the elk stack in one server. And also on the same server we have installed metricbeat. Version 8.6.1 for elk as well as metricbeat
When we send the metrics to metricbeat we see that metrics are shown up in Kibana for just only a minute then it stops all together. We would like to check with you the reason why it is behaving in this way.
We are using Prometheus remote_write on a node to push the metrics onto metricbeat on the server, have defined it in this way.
remote_write:
- url: http://10.127.3.41:9201/write
remote_timeout: 30s
queue_config:
capacity: 20000
max_shards: 30
max_samples_per_send: 10000
This is the metricbeat.yml
[root@r620-HMM48Y1 metricbeat-8.6.1-linux-x86_64]# cat metricbeat.yml
###################### Metricbeat Configuration Example #######################
# This file is an example configuration file highlighting only the most common
# options. The metricbeat.reference.yml file from the same directory contains all the
# supported options with more comments. You can use it as a reference.
#
# You can find the full configuration reference here:
# https://www.elastic.co/guide/en/beats/metricbeat/index.html
# =========================== Modules configuration ============================
metricbeat.config.modules:
# Glob pattern for configuration loading
path: ${path.config}/modules.d/*.yml
# Set to true to enable config reloading
reload.enabled: false
# Period on which files under path should be checked for changes
#reload.period: 10s
# ======================= Elasticsearch template setting =======================
setup.template.settings:
index.number_of_shards: 100
index.codec: best_compression
index.mapping.total_fields.limit: 50000
#_source.enabled: false
# ====================== Index Lifecycle Management (ILM) ======================
# Configure index lifecycle management (ILM) to manage the backing indices
# of your data streams.
# Enable ILM support. Valid values are true, false.
setup.ilm.enabled: true
# Set the lifecycle policy name. The default policy name is
# 'beatname'.
#setup.ilm.policy_name: "mypolicy"
# The path to a JSON file that contains a lifecycle policy configuration. Used
# to load your own lifecycle policy.
#setup.ilm.policy_file:
# Disable the check for an existing lifecycle policy. The default is true. If
# you disable this check, set setup.ilm.overwrite: true so the lifecycle policy
# can be installed.
#setup.ilm.check_exists: true
# Overwrite the lifecycle policy at startup. The default is false.
setup.ilm.overwrite: true
# ================================== General ===================================
# The name of the shipper that publishes the network data. It can be used to group
# all the transactions sent by a single shipper in the web interface.
#name:
# The tags of the shipper are included in their own field with each
# transaction published.
#tags: ["service-X", "web-tier"]
# Optional fields that you can specify to add additional information to the
# output.
#fields:
# env: staging
# ================================= Dashboards =================================
# These settings control loading the sample dashboards to the Kibana index. Loading
# the dashboards is disabled by default and can be enabled either by setting the
# options here or by using the `setup` command.
#setup.dashboards.enabled: false
# The URL from where to download the dashboards archive. By default this URL
# has a value which is computed based on the Beat name and version. For released
# versions, this URL points to the dashboard archive on the artifacts.elastic.co
# website.
#setup.dashboards.url:
# =================================== Kibana ===================================
# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.
setup.kibana:
# Kibana Host
# Scheme and port can be left out and will be set to the default (http and 5601)
# In case you specify and additional path, the scheme is required: http://localhost:5601/path
# IPv6 addresses should always be defined as: https://[2001:db8::1]:5601
host: "localhost:5601"
#protocol: http
#username: elastic
#password: changeme
#protocol: "http"
#username: "elastic"
#password: "changeme"
# Kibana Space ID
# ID of the Kibana Space into which the dashboards should be loaded. By default,
# the Default Space will be used.
#space.id:
# =============================== Elastic Cloud ================================
# These settings simplify using Metricbeat with the Elastic Cloud (https://cloud.elastic.co/).
# The cloud.id setting overwrites the `output.elasticsearch.hosts` and
# `setup.kibana.host` options.
# You can find the `cloud.id` in the Elastic Cloud web UI.
#cloud.id:
# The cloud.auth setting overwrites the `output.elasticsearch.username` and
# `output.elasticsearch.password` settings. The format is `<user>:<pass>`.
#cloud.auth:
# ================================== Outputs ===================================
# Configure what output to use when sending the data collected by the beat.
# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
# Array of hosts to connect to.
#hosts: ["localhost:9200"]
hosts: ["0.0.0.0:9200"]
# Protocol - either `http` (default) or `https`.
#protocol: "https"
# Authentication credentials - either API key or username/password.
#api_key: "id:api_key"
username: "elastic"
password: "changeme"
# ------------------------------ Logstash Output -------------------------------
#output.logstash:
# The Logstash hosts
#hosts: ["localhost:5044"]
# Optional SSL. By default is off.
# List of root certificates for HTTPS server verifications
#ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]
# Certificate for SSL client authentication
#ssl.certificate: "/etc/pki/client/cert.pem"
# Client Certificate Key
#ssl.key: "/etc/pki/client/cert.key"
# ================================= Processors =================================
# Configure processors to enhance or manipulate events generated by the beat.
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
- add_docker_metadata: ~
- add_kubernetes_metadata: ~
# ================================== Logging ===================================
# Sets log level. The default log level is info.
# Available log levels are: error, warning, info, debug
#logging.level: debug
# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publisher", "service".
#logging.selectors: ["*"]
# ============================= X-Pack Monitoring ==============================
# Metricbeat can export internal metrics to a central Elasticsearch monitoring
# cluster. This requires xpack monitoring to be enabled in Elasticsearch. The
# reporting is disabled by default.
# Set to true to enable the monitoring reporter.
#monitoring.enabled: false
# Sets the UUID of the Elasticsearch cluster under which monitoring data for this
# Metricbeat instance will appear in the Stack Monitoring UI. If output.elasticsearch
# is enabled, the UUID is derived from the Elasticsearch cluster referenced by output.elasticsearch.
#monitoring.cluster_uuid:
# Uncomment to send the metrics to Elasticsearch. Most settings from the
# Elasticsearch output are accepted here as well.
# Note that the settings should point to your Elasticsearch *monitoring* cluster.
# Any setting that is not set is automatically inherited from the Elasticsearch
# output configuration, so if you have the Elasticsearch output configured such
# that it is pointing to your Elasticsearch monitoring cluster, you can simply
# uncomment the following line.
monitoring.elasticsearch:
# ============================== Instrumentation ===============================
# Instrumentation support for the metricbeat.
#instrumentation:
# Set to true to enable instrumentation of metricbeat.
#enabled: false
# Environment in which metricbeat is running on (eg: staging, production, etc.)
#environment: ""
# APM Server hosts to report instrumentation results to.
#hosts:
# - http://localhost:8200
# API Key for the APM Server(s).
# If api_key is set then secret_token will be ignored.
#api_key:
# Secret token for the APM Server(s).
#secret_token:
# ================================= Migration ==================================
# This allows to enable 6.7 migration aliases
#migration.6_to_7.enabled: true
http:
enabled: true
host: 0.0.0.0
metricbeat modules directory.
[root@r620-HMM48Y1 metricbeat-8.6.1-linux-x86_64]# ll modules.d/ | grep -v disabled
total 276
-rw-r--r--. 1 root root 2248 Feb 9 12:15 prometheus.yml
[root@r620-HMM48Y1 metricbeat-8.6.1-linux-x86_64]#
modules.d/prometheus.yml
[root@r620-HMM48Y1 modules.d]# cat prometheus.yml
# Module: prometheus
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/main/metricbeat-module-prometheus.html
#- module: prometheus
# period: 10s
# hosts: ["localhost:9090"]
# metrics_path: /metrics
#metrics_filters:
# include: []
# exclude: []
#username: "user"
#password: "secret"
# This can be used for service account based authorization:
#bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
#ssl.certificate_authorities:
# - /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
# Use Elasticsearch histogram type to store histograms (beta, default: false)
# This will change the default layout and put metric type in the field name
#use_types: true
# Store counter rates instead of original cumulative counters (experimental, default: false)
#rate_counters: true
# Metrics sent by a Prometheus server using remote_write option
- module: prometheus
metricsets: ["remote_write"]
host: "0.0.0.0"
port: "9201"
# Secure settings for the server using TLS/SSL:
#ssl.certificate: "/etc/pki/server/cert.pem"
#ssl.key: "/etc/pki/server/cert.key"
# Use Elasticsearch histogram type to store histograms (beta, default: false)
# This will change the default layout and put metric type in the field name
#use_types: true
# Store counter rates instead of original cumulative counters (experimental, default: false)
#rate_counters: true
# Define patterns for counter and histogram types so as to identify metrics' types according to these patterns
#types_patterns:
# counter_patterns: []
# histogram_patterns: []
# Metrics that will be collected using a PromQL
#- module: prometheus
# metricsets: ["query"]
# hosts: ["localhost:9090"]
# period: 10s
# queries:
# - name: "instant_vector"
# path: "/api/v1/query"
# params:
# query: "sum(rate(prometheus_http_requests_total[1m]))"
# - name: "range_vector"
# path: "/api/v1/query_range"
# params:
# query: "up"
# start: "2019-12-20T00:00:00.000Z"
# end: "2019-12-21T00:00:00.000Z"
# step: 1h
# - name: "scalar"
# path: "/api/v1/query"
# params:
# query: "100"
# - name: "string"
# path: "/api/v1/query"
# params:
# query: "some_value"
[root@r620-HMM48Y1 modules.d]#
Error seen from prometheus server.
2023-02-15T13:46:05.488571695Z ts=2023-02-15T13:46:05.488Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Starting WAL watcher" queue=c6a288
2023-02-15T13:46:05.488571695Z ts=2023-02-15T13:46:05.488Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Starting scraped metadata watcher"
2023-02-15T13:46:05.488859474Z ts=2023-02-15T13:46:05.488Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Replaying WAL" queue=c6a288
2023-02-15T13:46:13.110398450Z ts=2023-02-15T13:46:13.110Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Done replaying WAL" duration=7.621550507s
2023-02-15T13:46:25.489716742Z ts=2023-02-15T13:46:25.489Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Remote storage resharding" from=1 to=8
2023-02-15T13:46:55.489263739Z ts=2023-02-15T13:46:55.489Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Remote storage resharding" from=8 to=30
2023-02-15T13:47:55.490105217Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9944 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9853 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9904 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=10000 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9979 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9970 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9980 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9960 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490399665Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to flush all samples on shutdown" count=369590
2023-02-15T13:48:26.252035320Z ts=2023-02-15T13:48:26.251Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
2023-02-15T13:48:26.252219664Z ts=2023-02-15T13:48:26.251Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
2023-02-15T13:49:26.435092096Z ts=2023-02-15T13:49:26.434Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
2023-02-15T13:49:26.435355966Z ts=2023-02-15T13:49:26.434Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
2023-02-15T13:50:27.158318266Z ts=2023-02-15T13:50:27.158Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
2023-02-15T13:50:27.158535798Z ts=2023-02-15T13:50:27.158Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
Basically this error will go on indefinetely.
msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
from metricbeat/logs
{"log.level":"info","@timestamp":"2023-02-14T10:09:59.177-0500","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpuacct":{"total":{"ns":29890531}},"memory":{"mem":{"usage":{"bytes":335745024}}}},"cpu":{"system":{"ticks":180,"time":{"ms":20}},"total":{"ticks":450,"time":{"ms":30},"value":450},"user":{"ticks":270,"time":{"ms":10}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":13},"info":{"ephemeral_id":"e4e3ddca-2c84-47d0-959b-e6cc8cdd8ad2","uptime":{"ms":63293},"version":"8.6.1"},"memstats":{"gc_next":25258616,"memory_alloc":20286384,"memory_sys":4194304,"memory_total":42085160,"rss":104591360},"runtime":{"goroutines":35}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0.12,"15":0.05,"5":0.05,"norm":{"1":0.0038,"15":0.0016,"5":0.0016}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2023-02-14T10:10:29.178-0500","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpuacct":{"total":{"ns":268288423}},"memory":{"mem":{"usage":{"bytes":337440768}}}},"cpu":{"system":{"ticks":190,"time":{"ms":10}},"total":{"ticks":470,"time":{"ms":20},"value":470},"user":{"ticks":280,"time":{"ms":10}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":13},"info":{"ephemeral_id":"e4e3ddca-2c84-47d0-959b-e6cc8cdd8ad2","uptime":{"ms":93295},"version":"8.6.1"},"memstats":{"gc_next":25258616,"memory_alloc":22072976,"memory_total":43871752,"rss":106127360},"runtime":{"goroutines":35}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0.07,"15":0.05,"5":0.05,"norm":{"1":0.0022,"15":0.0016,"5":0.0016}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2023-02-14T10:10:59.179-0500","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpuacct":{"total":{"ns":44731643}},"memory":{"mem":{"usage":{"bytes":339222528}}}},"cpu":{"system":{"ticks":210,"time":{"ms":20}},"total":{"ticks":510,"time":{"ms":40},"value":510},"user":{"ticks":300,"time":{"ms":20}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":13},"info":{"ephemeral_id":"e4e3ddca-2c84-47d0-959b-e6cc8cdd8ad2","uptime":{"ms":123296},"version":"8.6.1"},"memstats":{"gc_next":25258616,"memory_alloc":23070192,"memory_total":44868968,"rss":107188224},"runtime":{"goroutines":35}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0.04,"15":0.05,"5":0.04,"norm":{"1":0.0013,"15":0.0016,"5":0.0013}}}},"ecs.version":"1.6.0"}}
As a note we replaced the elk stack with victoriametrics on the same server hosting the elk stack and the issue explained here is not seen. All metrics are stored in victoriametrics, but we don't want to use victoriametrics this was just a test, we would like to fix the situation with elk and metricbeat.
Versions:
elk = 8.6.1
metricbeat = 8.6.1
prometheus = 2.32.1
Any advice or hint that you may suggest is greatly appreciated.