My experiences setting up Metricbeat for monitoring an Elastic cluster

For those who are interested here is my experience deploying Metricbeat with a monitoring cluster ("Stack Monitoring").

In a nutshell:

  1. When you monitor a non-trivial Elasticsearch cluster, be sure to add scope: cluster to your Metricbeat configuration.
  2. Elastic should "backport" major documentation changes to all supported versions.
  3. I used HAProxy to monitor my Elasticsearch cluster with Metricbeat. See my config below, but let me know if you have suggestions or improvements.

Here are the details.

I have installed a monitoring cluster to monitor my Elastic Stack platform with X-Pack. I use Metricbeat and Filebeat to ship data to this monitoring cluster.

My Elasticsearch cluster has dedicated master-eligible nodes and I noticed that Metricbeat on the active master node used more resources than expected. This is also reported elsewhere on the forum. I even had performance issues with master elections. I now think that Metricbeat may have contributed to this problem.

When I installed the monitoring cluster I used the documentation for Metricbeat 7.10/7.11 that was current at the time. It says:

Install Metricbeat on each Elasticsearch node in the production cluster. Failure to install on each node may result in incomplete or missing results. By default, scope is set to node and each entry in the hosts list indicates a distinct node in an Elasticsearch cluster.

So I installed Metricbeat and enabled the Elasticsearch-xpack.yml module on all Elasticsearch nodes (scope: node was and still is the default).

But when I read the documentation of more recent Metricbeat versions I noticed that Elastic's advice has changed:

Ideally install a single Metricbeat instance configured with scope: cluster and configure hosts to point to an endpoint (e.g. a load-balancing proxy) which directs requests to the master-ineligible nodes in the cluster. If this is not possible then install one Metricbeat instance for each Elasticsearch node in the production cluster and use the default scope: node. [...] Metricbeat with scope: node collects most of the metrics from the elected master of the cluster, so you must scale up all your master-eligible nodes to account for this extra load and you should not use this mode if you have dedicated master nodes.

They now mention that scope: node increases the load on the master nodes, which I can confirm. This means that scope: cluster is now recommended for non-trivial setups! When I adapted this for my Metricbeat setup the CPU usage on my master node dropped considerably. Perhaps I missed it but I could not find this important change in the Metricbeat release notes. David Turner of Elastic created an issue in GitHub and updated the documentation but this change was not added to the online guide of older Metricbeat versions that are still supported (versions 7.10 and 7.11 at the time of this writing).

My suggestion to Elastic would be to "backport" important changes to the documentation of all supported versions.

So now the online guide recommends the following if you want to monitor a non-trivial Elasticsearch cluster with Metricbeat:

  • Use a single(!) instance of Metricbeat.
  • Only query nodes that are not master-eligble.
  • Add the parameter scope: cluster to your Metricbeat configuration.
  • Add the parameter hosts: <end-point> where <end-point> is a single end-point, not a list of data nodes.

Unfortunately this single end-point parameter is not discussed extensively in the documentation. Elastic don't even provide a recommendation. I believe the different options that you have would include:

  • Your Elasticsearch cluster already has a single end-point, for example an Elastic Cloud instance.
  • You pick a single data node at random. This does not provide redundancy so if the data node goes down, so does your monitoring.
  • You create a single end-point that consists of a round robin DNS record with the IP addresses of multiple data nodes. Avoid using DNS caching on the Metricbeat host. This is a poor man's implementation of load-balancing. If a data node goes down then there will be some impact on monitoring.
  • You add a coordinating node to your Elasticsearch cluster which acts as load-balancing proxy to the data nodes. This does not provide redundancy, so if the coordinating node goes down: no more monitoring.
  • You add a load-balancing proxy with multiple data nodes as members. This could be your cloud provider's load-balancer such as AWS ELB, a hardware based load-balancer such as an F5, or a software based load-balancer.

I decided to go for the software based HAProxy because it is light weight and supports health checks. I installed HAProxy on the same system that is running Metricbeat. Does anyone have comments on the config below? Would you consider a health check interval of 2 seconds too aggressive? I am checking for the Elasticsearch HTTP banner that contains the Lucene version and the tagline. Or would a simple TCP port check be sufficient?

If desired I could post my Filebeat configuration files as well. These ship log messages from my Elastic Stack platform to the monitoring cluster. The configs are relatively straightforward. I only had to make sure that log messages generated by Metricbeat are filtered out by Filebeat.


# Comment out the next line to disable logging Metricbeat requests
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
# Inspect stats with: hatop -s /run/haproxy/admin.sock
        stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
        stats timeout 30s
        user haproxy
        group haproxy

        log     global
        mode    http
        option  httplog
        option  dontlognull
        option  forwardfor except
        timeout connect 5000
        timeout client  50000
        timeout server  50000
        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http
        retries                 3
        timeout http-request    10s
        timeout queue           1m
        timeout connect         10s
        timeout client          1m
        timeout server          1m
        timeout http-keep-alive 10s
        timeout check           10s
        maxconn                 3000

frontend ft_elasticsearch_http
        bind name elasticsearch_http
        default_backend bk_elasticsearch_http

backend bk_elasticsearch_http
        mode http
        option httpchk GET / HTTP/1.1\r\n
        http-check expect string lucene_version
        balance roundrobin
        option log-health-checks
        default-server check inter 2s fall 3 rise 2
        server datanode-00.example.com_9200
        server datanode-01.example.com_9200
        server datanode-02.example.com_9200
        server datanode-XX.example.com_9200

#----------------------------- Elasticsearch Module --------------------------
- module: elasticsearch
  enabled: true
  period: 10s
  # Get Elasticsearch status from data nodes via a locally installed HAProxy
  hosts: [""]
  xpack.enabled: true
  scope: cluster

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.