Prometheus remote_write to metricbeat shows up in Kibana for one minute, then stops working

Hi.
I have set up the elk stack in one server. And also on the same server we have installed metricbeat. Version 8.6.1 for elk as well as metricbeat
When we send the metrics to metricbeat we see that metrics are shown up in Kibana for just only a minute then it stops all together. We would like to check with you the reason why it is behaving in this way.
We are using Prometheus remote_write on a node to push the metrics onto metricbeat on the server, have defined it in this way.

remote_write:
- url: http://10.127.3.41:9201/write
  remote_timeout: 30s
  queue_config:
    capacity: 20000
    max_shards: 30
    max_samples_per_send: 10000

This is the metricbeat.yml

[root@r620-HMM48Y1 metricbeat-8.6.1-linux-x86_64]# cat metricbeat.yml
###################### Metricbeat Configuration Example #######################

# This file is an example configuration file highlighting only the most common
# options. The metricbeat.reference.yml file from the same directory contains all the
# supported options with more comments. You can use it as a reference.
#
# You can find the full configuration reference here:
# https://www.elastic.co/guide/en/beats/metricbeat/index.html

# =========================== Modules configuration ============================

metricbeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml

  # Set to true to enable config reloading
  reload.enabled: false

  # Period on which files under path should be checked for changes
  #reload.period: 10s

# ======================= Elasticsearch template setting =======================

setup.template.settings:
  index.number_of_shards: 100
  index.codec: best_compression
  index.mapping.total_fields.limit: 50000
  #_source.enabled: false

# ====================== Index Lifecycle Management (ILM) ======================

# Configure index lifecycle management (ILM) to manage the backing indices
# of your data streams.

# Enable ILM support. Valid values are true, false.
setup.ilm.enabled: true

# Set the lifecycle policy name. The default policy name is
# 'beatname'.
#setup.ilm.policy_name: "mypolicy"

# The path to a JSON file that contains a lifecycle policy configuration. Used
# to load your own lifecycle policy.
#setup.ilm.policy_file:

# Disable the check for an existing lifecycle policy. The default is true. If
# you disable this check, set setup.ilm.overwrite: true so the lifecycle policy
# can be installed.
#setup.ilm.check_exists: true

# Overwrite the lifecycle policy at startup. The default is false.
setup.ilm.overwrite: true
# ================================== General ===================================

# The name of the shipper that publishes the network data. It can be used to group
# all the transactions sent by a single shipper in the web interface.
#name:

# The tags of the shipper are included in their own field with each
# transaction published.
#tags: ["service-X", "web-tier"]

# Optional fields that you can specify to add additional information to the
# output.
#fields:
#  env: staging

# ================================= Dashboards =================================
# These settings control loading the sample dashboards to the Kibana index. Loading
# the dashboards is disabled by default and can be enabled either by setting the
# options here or by using the `setup` command.
#setup.dashboards.enabled: false

# The URL from where to download the dashboards archive. By default this URL
# has a value which is computed based on the Beat name and version. For released
# versions, this URL points to the dashboard archive on the artifacts.elastic.co
# website.
#setup.dashboards.url:

# =================================== Kibana ===================================

# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.
setup.kibana:

  # Kibana Host
  # Scheme and port can be left out and will be set to the default (http and 5601)
  # In case you specify and additional path, the scheme is required: http://localhost:5601/path
  # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601
  host: "localhost:5601"

  #protocol: http
  #username: elastic
  #password: changeme
  #protocol: "http"
  #username: "elastic"
  #password: "changeme"

  # Kibana Space ID
  # ID of the Kibana Space into which the dashboards should be loaded. By default,
  # the Default Space will be used.
  #space.id:

# =============================== Elastic Cloud ================================

# These settings simplify using Metricbeat with the Elastic Cloud (https://cloud.elastic.co/).

# The cloud.id setting overwrites the `output.elasticsearch.hosts` and
# `setup.kibana.host` options.
# You can find the `cloud.id` in the Elastic Cloud web UI.
#cloud.id:

# The cloud.auth setting overwrites the `output.elasticsearch.username` and
# `output.elasticsearch.password` settings. The format is `<user>:<pass>`.
#cloud.auth:

# ================================== Outputs ===================================

# Configure what output to use when sending the data collected by the beat.

# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
  # Array of hosts to connect to.
  #hosts: ["localhost:9200"]
  hosts: ["0.0.0.0:9200"]

  # Protocol - either `http` (default) or `https`.
  #protocol: "https"

  # Authentication credentials - either API key or username/password.
  #api_key: "id:api_key"
  username: "elastic"
  password: "changeme"

# ------------------------------ Logstash Output -------------------------------
#output.logstash:
  # The Logstash hosts
  #hosts: ["localhost:5044"]

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]

  # Certificate for SSL client authentication
  #ssl.certificate: "/etc/pki/client/cert.pem"

  # Client Certificate Key
  #ssl.key: "/etc/pki/client/cert.key"

# ================================= Processors =================================

# Configure processors to enhance or manipulate events generated by the beat.

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - add_kubernetes_metadata: ~


# ================================== Logging ===================================

# Sets log level. The default log level is info.
# Available log levels are: error, warning, info, debug
#logging.level: debug

# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publisher", "service".
#logging.selectors: ["*"]

# ============================= X-Pack Monitoring ==============================
# Metricbeat can export internal metrics to a central Elasticsearch monitoring
# cluster.  This requires xpack monitoring to be enabled in Elasticsearch.  The
# reporting is disabled by default.

# Set to true to enable the monitoring reporter.
#monitoring.enabled: false

# Sets the UUID of the Elasticsearch cluster under which monitoring data for this
# Metricbeat instance will appear in the Stack Monitoring UI. If output.elasticsearch
# is enabled, the UUID is derived from the Elasticsearch cluster referenced by output.elasticsearch.
#monitoring.cluster_uuid:

# Uncomment to send the metrics to Elasticsearch. Most settings from the
# Elasticsearch output are accepted here as well.
# Note that the settings should point to your Elasticsearch *monitoring* cluster.
# Any setting that is not set is automatically inherited from the Elasticsearch
# output configuration, so if you have the Elasticsearch output configured such
# that it is pointing to your Elasticsearch monitoring cluster, you can simply
# uncomment the following line.
monitoring.elasticsearch:

# ============================== Instrumentation ===============================

# Instrumentation support for the metricbeat.
#instrumentation:
    # Set to true to enable instrumentation of metricbeat.
    #enabled: false

    # Environment in which metricbeat is running on (eg: staging, production, etc.)
    #environment: ""

    # APM Server hosts to report instrumentation results to.
    #hosts:
    #  - http://localhost:8200

    # API Key for the APM Server(s).
    # If api_key is set then secret_token will be ignored.
    #api_key:

    # Secret token for the APM Server(s).
    #secret_token:


# ================================= Migration ==================================

# This allows to enable 6.7 migration aliases
#migration.6_to_7.enabled: true

http:
  enabled: true
  host: 0.0.0.0

metricbeat modules directory.

[root@r620-HMM48Y1 metricbeat-8.6.1-linux-x86_64]# ll modules.d/ | grep -v disabled
total 276
-rw-r--r--. 1 root root 2248 Feb  9 12:15 prometheus.yml
[root@r620-HMM48Y1 metricbeat-8.6.1-linux-x86_64]#

modules.d/prometheus.yml

[root@r620-HMM48Y1 modules.d]# cat prometheus.yml
# Module: prometheus
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/main/metricbeat-module-prometheus.html

#- module: prometheus
#  period: 10s
#  hosts: ["localhost:9090"]
#  metrics_path: /metrics
  #metrics_filters:
  #  include: []
  #  exclude: []
  #username: "user"
  #password: "secret"

  # This can be used for service account based authorization:
  #bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  #ssl.certificate_authorities:
  #  - /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt

  # Use Elasticsearch histogram type to store histograms (beta, default: false)
  # This will change the default layout and put metric type in the field name
  #use_types: true

  # Store counter rates instead of original cumulative counters (experimental, default: false)
  #rate_counters: true

# Metrics sent by a Prometheus server using remote_write option
- module: prometheus
  metricsets: ["remote_write"]
  host: "0.0.0.0"
  port: "9201"

  # Secure settings for the server using TLS/SSL:
  #ssl.certificate: "/etc/pki/server/cert.pem"
  #ssl.key: "/etc/pki/server/cert.key"

  # Use Elasticsearch histogram type to store histograms (beta, default: false)
  # This will change the default layout and put metric type in the field name
  #use_types: true

  # Store counter rates instead of original cumulative counters (experimental, default: false)
  #rate_counters: true

  # Define patterns for counter and histogram types so as to identify metrics' types according to these patterns
  #types_patterns:
  #  counter_patterns: []
  #  histogram_patterns: []

# Metrics that will be collected using a PromQL
#- module: prometheus
#  metricsets: ["query"]
#  hosts: ["localhost:9090"]
#  period: 10s
#  queries:
#  - name: "instant_vector"
#    path: "/api/v1/query"
#    params:
#      query: "sum(rate(prometheus_http_requests_total[1m]))"
#  - name: "range_vector"
#    path: "/api/v1/query_range"
#    params:
#      query: "up"
#      start: "2019-12-20T00:00:00.000Z"
#      end:  "2019-12-21T00:00:00.000Z"
#      step: 1h
#  - name: "scalar"
#    path: "/api/v1/query"
#    params:
#      query: "100"
#  - name: "string"
#    path: "/api/v1/query"
#    params:
#      query: "some_value"
[root@r620-HMM48Y1 modules.d]#

Error seen from prometheus server.

2023-02-15T13:46:05.488571695Z ts=2023-02-15T13:46:05.488Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Starting WAL watcher" queue=c6a288
2023-02-15T13:46:05.488571695Z ts=2023-02-15T13:46:05.488Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Starting scraped metadata watcher"
2023-02-15T13:46:05.488859474Z ts=2023-02-15T13:46:05.488Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Replaying WAL" queue=c6a288
2023-02-15T13:46:13.110398450Z ts=2023-02-15T13:46:13.110Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Done replaying WAL" duration=7.621550507s
2023-02-15T13:46:25.489716742Z ts=2023-02-15T13:46:25.489Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Remote storage resharding" from=1 to=8
2023-02-15T13:46:55.489263739Z ts=2023-02-15T13:46:55.489Z caller=dedupe.go:112 component=remote level=info remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Remote storage resharding" from=8 to=30
2023-02-15T13:47:55.490105217Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9944 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9853 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9904 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=10000 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9979 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9970 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9980 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490259933Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="non-recoverable error" count=9960 exemplarCount=0 err="context canceled"
2023-02-15T13:47:55.490399665Z ts=2023-02-15T13:47:55.490Z caller=dedupe.go:112 component=remote level=error remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to flush all samples on shutdown" count=369590
2023-02-15T13:48:26.252035320Z ts=2023-02-15T13:48:26.251Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
2023-02-15T13:48:26.252219664Z ts=2023-02-15T13:48:26.251Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
2023-02-15T13:49:26.435092096Z ts=2023-02-15T13:49:26.434Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
2023-02-15T13:49:26.435355966Z ts=2023-02-15T13:49:26.434Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
2023-02-15T13:50:27.158318266Z ts=2023-02-15T13:50:27.158Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"
2023-02-15T13:50:27.158535798Z ts=2023-02-15T13:50:27.158Z caller=dedupe.go:112 component=remote level=warn remote_name=c6a288 url=http://10.127.3.41:9201/write msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"

Basically this error will go on indefinetely.

msg="Failed to send batch, retrying" err="Post \"http://10.127.3.41:9201/write\": context deadline exceeded"

from metricbeat/logs

{"log.level":"info","@timestamp":"2023-02-14T10:09:59.177-0500","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpuacct":{"total":{"ns":29890531}},"memory":{"mem":{"usage":{"bytes":335745024}}}},"cpu":{"system":{"ticks":180,"time":{"ms":20}},"total":{"ticks":450,"time":{"ms":30},"value":450},"user":{"ticks":270,"time":{"ms":10}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":13},"info":{"ephemeral_id":"e4e3ddca-2c84-47d0-959b-e6cc8cdd8ad2","uptime":{"ms":63293},"version":"8.6.1"},"memstats":{"gc_next":25258616,"memory_alloc":20286384,"memory_sys":4194304,"memory_total":42085160,"rss":104591360},"runtime":{"goroutines":35}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0.12,"15":0.05,"5":0.05,"norm":{"1":0.0038,"15":0.0016,"5":0.0016}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2023-02-14T10:10:29.178-0500","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpuacct":{"total":{"ns":268288423}},"memory":{"mem":{"usage":{"bytes":337440768}}}},"cpu":{"system":{"ticks":190,"time":{"ms":10}},"total":{"ticks":470,"time":{"ms":20},"value":470},"user":{"ticks":280,"time":{"ms":10}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":13},"info":{"ephemeral_id":"e4e3ddca-2c84-47d0-959b-e6cc8cdd8ad2","uptime":{"ms":93295},"version":"8.6.1"},"memstats":{"gc_next":25258616,"memory_alloc":22072976,"memory_total":43871752,"rss":106127360},"runtime":{"goroutines":35}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0.07,"15":0.05,"5":0.05,"norm":{"1":0.0022,"15":0.0016,"5":0.0016}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2023-02-14T10:10:59.179-0500","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpuacct":{"total":{"ns":44731643}},"memory":{"mem":{"usage":{"bytes":339222528}}}},"cpu":{"system":{"ticks":210,"time":{"ms":20}},"total":{"ticks":510,"time":{"ms":40},"value":510},"user":{"ticks":300,"time":{"ms":20}}},"handles":{"limit":{"hard":4096,"soft":1024},"open":13},"info":{"ephemeral_id":"e4e3ddca-2c84-47d0-959b-e6cc8cdd8ad2","uptime":{"ms":123296},"version":"8.6.1"},"memstats":{"gc_next":25258616,"memory_alloc":23070192,"memory_total":44868968,"rss":107188224},"runtime":{"goroutines":35}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":0}}},"system":{"load":{"1":0.04,"15":0.05,"5":0.04,"norm":{"1":0.0013,"15":0.0016,"5":0.0013}}}},"ecs.version":"1.6.0"}}

As a note we replaced the elk stack with victoriametrics on the same server hosting the elk stack and the issue explained here is not seen. All metrics are stored in victoriametrics, but we don't want to use victoriametrics this was just a test, we would like to fix the situation with elk and metricbeat.

Versions:
elk = 8.6.1
metricbeat = 8.6.1
prometheus = 2.32.1

Any advice or hint that you may suggest is greatly appreciated.

Hi @megaxm and welcome to the community!

What do the elasticsearch logs show during this situation?
While you have everything running, you may also want to check GET _cat/thread_pool?v to see if your threads are busy.

If you enable the HTTP port on metricbeat, this may also give you some insight into what metricbeat is doing while this issue is occurring:

Hi Eddie.
Thanks so much for the welcome and your help.
These are the elasticsearch logs, when the situation is happening.

2023-02-15T20:47:52.400567501Z {"@timestamp":"2023-02-15T20:47:52.399Z", "log.level": "INFO", "message":"[.ds-metricbeat-8.6.1-2023.02.15-000001/eIR_lE0zSA2ZlvrW6JnZdQ] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"-bkC2to_TeKOi2aRNKE5Lw","elasticsearch.node.id":"J4YJCN1NQhSeMnUqjzwzyA","elasticsearch.node.name":"elasticsearch","elasticsearch.cluster.name":"docker-cluster"}
2023-02-15T20:48:12.233683569Z {"@timestamp":"2023-02-15T20:48:12.233Z", "log.level": "INFO", "message":"[.ds-metricbeat-8.6.1-2023.02.15-000001/eIR_lE0zSA2ZlvrW6JnZdQ] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"-bkC2to_TeKOi2aRNKE5Lw","elasticsearch.node.id":"J4YJCN1NQhSeMnUqjzwzyA","elasticsearch.node.name":"elasticsearch","elasticsearch.cluster.name":"docker-cluster"}
2023-02-15T20:48:31.278915540Z {"@timestamp":"2023-02-15T20:48:31.278Z", "log.level": "INFO", "message":"[.ds-metricbeat-8.6.1-2023.02.15-000001/eIR_lE0zSA2ZlvrW6JnZdQ] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"-bkC2to_TeKOi2aRNKE5Lw","elasticsearch.node.id":"J4YJCN1NQhSeMnUqjzwzyA","elasticsearch.node.name":"elasticsearch","elasticsearch.cluster.name":"docker-cluster"}
2023-02-15T20:48:40.006712262Z {"@timestamp":"2023-02-15T20:48:40.006Z", "log.level": "INFO", "message":"[.ds-metricbeat-8.6.1-2023.02.15-000001/eIR_lE0zSA2ZlvrW6JnZdQ] update_mapping [_doc]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataMappingService","elasticsearch.cluster.uuid":"-bkC2to_TeKOi2aRNKE5Lw","elasticsearch.node.id":"J4YJCN1NQhSeMnUqjzwzyA","elasticsearch.node.name":"elasticsearch","elasticsearch.cluster.name":"docker-cluster"}
[root@r620-HMM48Y1 docker-elk]#

This is the thread_pool as well when the issue is obeserved

[root@r620-HMM48Y1 logs]# curl -X GET "http://localhost:9200/_cat/thread_pool?v" -u elastic:changeme
node_name     name                                   active queue rejected
elasticsearch analyze                                     0     0        0
elasticsearch auto_complete                               0     0        0
elasticsearch azure_event_loop                            0     0        0
elasticsearch ccr                                         0     0        0
elasticsearch cluster_coordination                        0     0        0
elasticsearch fetch_shard_started                         0     0        0
elasticsearch fetch_shard_store                           0     0        0
elasticsearch flush                                       0     0        0
elasticsearch force_merge                                 0     0        0
elasticsearch generic                                     0     0        0
elasticsearch get                                         0     0        0
elasticsearch management                                  2     0        0
elasticsearch ml_datafeed                                 0     0        0
elasticsearch ml_job_comms                                0     0        0
elasticsearch ml_native_inference_comms                   0     0        0
elasticsearch ml_utility                                  0     0        0
elasticsearch refresh                                     1     0        0
elasticsearch repository_azure                            0     0        0
elasticsearch rollup_indexing                             0     0        0
elasticsearch search                                      0     0        0
elasticsearch search_coordination                         0     0        0
elasticsearch search_throttled                            0     0        0
elasticsearch searchable_snapshots_cache_fetch_async      0     0        0
elasticsearch searchable_snapshots_cache_prewarming       0     0        0
elasticsearch security-crypto                             0     0        0
elasticsearch security-token-key                          0     0        0
elasticsearch snapshot                                    0     0        0
elasticsearch snapshot_meta                               0     0        0
elasticsearch system_critical_read                        0     0        0
elasticsearch system_critical_write                       0     0        0
elasticsearch system_read                                 0     0        0
elasticsearch system_write                                0     0        0
elasticsearch vector_tile_generation                      0     0        0
elasticsearch warmer                                      0     0        0
elasticsearch watcher                                     0     0        0
elasticsearch write                                      14     1        0
[root@r620-HMM48Y1 logs]#

and the stats.

[root@r620-HMM48Y1 logs]# curl -X GET "localhost:5066/stats?pretty"
{
  "beat": {
    "cgroup": {
      "cpu": {
        "cfs": {
          "period": {
            "us": 100000
          },
          "quota": {
            "us": 0
          }
        },
        "id": "user.slice",
        "stats": {
          "periods": 0,
          "throttled": {
            "ns": 0,
            "periods": 0
          }
        }
      },
      "cpuacct": {
        "id": "user.slice",
        "total": {
          "ns": 165711386357912
        }
      },
      "memory": {
        "id": "user.slice",
        "mem": {
          "limit": {
            "bytes": 9223372036854771712
          },
          "usage": {
            "bytes": 4145954816
          }
        }
      }
    },
    "cpu": {
      "system": {
        "ticks": 419870,
        "time": {
          "ms": 419870
        }
      },
      "total": {
        "ticks": 2757100,
        "time": {
          "ms": 2757100
        },
        "value": 2757100
      },
      "user": {
        "ticks": 2337230,
        "time": {
          "ms": 2337230
        }
      }
    },
    "handles": {
      "limit": {
        "hard": 4096,
        "soft": 1024
      },
      "open": 76
    },
    "info": {
      "ephemeral_id": "c5f1803a-ee7a-4558-86a4-a1ffcbbc75f7",
      "name": "metricbeat",
      "uptime": {
        "ms": 4794080
      },
      "version": "8.6.1"
    },
    "memstats": {
      "gc_next": 3814310840,
      "memory_alloc": 3224135776,
      "memory_sys": 4901640104,
      "memory_total": 265249895136,
      "rss": 3995021312
    },
    "runtime": {
      "goroutines": 161
    }
  },
  "libbeat": {
    "config": {
      "module": {
        "running": 1,
        "starts": 1,
        "stops": 0
      },
      "reloads": 1,
      "scans": 1
    },
    "output": {
      "events": {
        "acked": 4418636,
        "active": 50,
        "batches": 88420,
        "dropped": 0,
        "duplicates": 0,
        "failed": 0,
        "toomany": 0,
        "total": 4418686
      },
      "read": {
        "bytes": 62418105,
        "errors": 0
      },
      "type": "elasticsearch",
      "write": {
        "bytes": 9110437379,
        "errors": 0
      }
    },
    "pipeline": {
      "clients": 1,
      "events": {
        "active": 4097,
        "dropped": 0,
        "failed": 0,
        "filtered": 0,
        "published": 4422732,
        "retry": 2048,
        "total": 4422733
      },
      "queue": {
        "acked": 4418636,
        "max_events": 4096
      }
    }
  },
  "metricbeat": {
    "prometheus": {
      "remote_write": {
        "events": 4422734,
        "failures": 0,
        "success": 4422735
      }
    }
  },
  "system": {
    "cpu": {
      "cores": 32
    },
    "load": {
      "1": 12.92,
      "15": 7.73,
      "5": 10.91,
      "norm": {
        "1": 0.4038,
        "15": 0.2416,
        "5": 0.3409
      }
    }
  }
}[root@r620-HMM48Y1 logs]#

Do you see anything in these printouts that may provide a hint?
Thanks so much / John

Hi @megaxm

I am just wondering if your queue_config configuration is not too aggressive, please check those recommendations

you also can check those prometheus metrics to see if samples are failing/dropping/retried:
prometheus_remote_storage_samples_pending
rate(prometheus_remote_storage_samples_dropped_total[2m])
rate(prometheus_remote_storage_samples_failed_total[2m])
rate(prometheus_remote_storage_samples_retried_total[2m])

If you have dropped/failed/retried samples - it is likely that you need to adjust configurations.

Hi @Tetiana_Kravchenko
I have removed the queue_config altogether, And I also added this filter source_labels into the prometheus config, the prometheus config looks like this:

remote_write:
- url: http://10.127.3.41:9201/write
  remote_timeout: 30s
  write_relabel_configs:
  - source_labels:
    - job
    regex: node-exporter
    action: keep

This works fine, no problem. I have 4 different nodes sending data to metricbeat/elk with the same configuration. They all work no interruption to the config.
But when I add a 5th node adding one more job to the config "kubelet" meaning

remote_write:
- url: http://10.127.3.41:9201/write
  remote_timeout: 30s
  write_relabel_configs:
  - source_labels:
    - job
    regex: node-exporter|kubelet
    action: keep

Then the 5th node stops working, plus all the previous 4 as well.
This make me think that there is a bottleneck in metricbeat.
I understand that the amount of data sent by the nodes is a lot, but is there a setting that you would recommend to try out in metricbeat or elasticsearch, when nodes sent huge amount of data?
As for the metrics that you asked me to collect, they are increaing except for prometheus_remote_storage_samples_failed_total.
Thanks
Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.