For those who are interested here is my experience deploying Metricbeat with a monitoring cluster ("Stack Monitoring").
In a nutshell:
- When you monitor a non-trivial Elasticsearch cluster, be sure to add
scope: clusterto your Metricbeat configuration.
- Elastic should "backport" major documentation changes to all supported versions.
- I used HAProxy to monitor my Elasticsearch cluster with Metricbeat. See my config below, but let me know if you have suggestions or improvements.
Here are the details.
I have installed a monitoring cluster to monitor my Elastic Stack platform with X-Pack. I use Metricbeat and Filebeat to ship data to this monitoring cluster.
My Elasticsearch cluster has dedicated master-eligible nodes and I noticed that Metricbeat on the active master node used more resources than expected. This is also reported elsewhere on the forum. I even had performance issues with master elections. I now think that Metricbeat may have contributed to this problem.
When I installed the monitoring cluster I used the documentation for Metricbeat 7.10/7.11 that was current at the time. It says:
Install Metricbeat on each Elasticsearch node in the production cluster. Failure to install on each node may result in incomplete or missing results. By default, scope is set to node and each entry in the hosts list indicates a distinct node in an Elasticsearch cluster.
So I installed Metricbeat and enabled the Elasticsearch-xpack.yml module on all Elasticsearch nodes (
scope: node was and still is the default).
But when I read the documentation of more recent Metricbeat versions I noticed that Elastic's advice has changed:
Ideally install a single Metricbeat instance configured with scope: cluster and configure hosts to point to an endpoint (e.g. a load-balancing proxy) which directs requests to the master-ineligible nodes in the cluster. If this is not possible then install one Metricbeat instance for each Elasticsearch node in the production cluster and use the default scope: node. [...] Metricbeat with scope: node collects most of the metrics from the elected master of the cluster, so you must scale up all your master-eligible nodes to account for this extra load and you should not use this mode if you have dedicated master nodes.
They now mention that
scope: node increases the load on the master nodes, which I can confirm. This means that
scope: cluster is now recommended for non-trivial setups! When I adapted this for my Metricbeat setup the CPU usage on my master node dropped considerably. Perhaps I missed it but I could not find this important change in the Metricbeat release notes. David Turner of Elastic created an issue in GitHub and updated the documentation but this change was not added to the online guide of older Metricbeat versions that are still supported (versions 7.10 and 7.11 at the time of this writing).
My suggestion to Elastic would be to "backport" important changes to the documentation of all supported versions.
So now the online guide recommends the following if you want to monitor a non-trivial Elasticsearch cluster with Metricbeat:
- Use a single(!) instance of Metricbeat.
- Only query nodes that are not master-eligble.
- Add the parameter
scope: clusterto your Metricbeat configuration.
- Add the parameter
<end-point>is a single end-point, not a list of data nodes.
Unfortunately this single end-point parameter is not discussed extensively in the documentation. Elastic don't even provide a recommendation. I believe the different options that you have would include:
- Your Elasticsearch cluster already has a single end-point, for example an Elastic Cloud instance.
- You pick a single data node at random. This does not provide redundancy so if the data node goes down, so does your monitoring.
- You create a single end-point that consists of a round robin DNS record with the IP addresses of multiple data nodes. Avoid using DNS caching on the Metricbeat host. This is a poor man's implementation of load-balancing. If a data node goes down then there will be some impact on monitoring.
- You add a coordinating node to your Elasticsearch cluster which acts as load-balancing proxy to the data nodes. This does not provide redundancy, so if the coordinating node goes down: no more monitoring.
- You add a load-balancing proxy with multiple data nodes as members. This could be your cloud provider's load-balancer such as AWS ELB, a hardware based load-balancer such as an F5, or a software based load-balancer.
I decided to go for the software based HAProxy because it is light weight and supports health checks. I installed HAProxy on the same system that is running Metricbeat. Does anyone have comments on the config below? Would you consider a health check interval of 2 seconds too aggressive? I am checking for the Elasticsearch HTTP banner that contains the Lucene version and the tagline. Or would a simple TCP port check be sufficient?
If desired I could post my Filebeat configuration files as well. These ship log messages from my Elastic Stack platform to the monitoring cluster. The configs are relatively straightforward. I only had to make sure that log messages generated by Metricbeat are filtered out by Filebeat.
/etc/haproxy/haproxy.cfg: global # Comment out the next line to disable logging Metricbeat requests log /dev/log local0 log /dev/log local1 notice chroot /var/lib/haproxy # Inspect stats with: hatop -s /run/haproxy/admin.sock stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners stats timeout 30s user haproxy group haproxy daemon defaults log global mode http option httplog option dontlognull option forwardfor except 127.0.0.0/8 timeout connect 5000 timeout client 50000 timeout server 50000 errorfile 400 /etc/haproxy/errors/400.http errorfile 403 /etc/haproxy/errors/403.http errorfile 408 /etc/haproxy/errors/408.http errorfile 500 /etc/haproxy/errors/500.http errorfile 502 /etc/haproxy/errors/502.http errorfile 503 /etc/haproxy/errors/503.http errorfile 504 /etc/haproxy/errors/504.http retries 3 timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 1m timeout server 1m timeout http-keep-alive 10s timeout check 10s maxconn 3000 frontend ft_elasticsearch_http bind 127.0.0.1:9201 name elasticsearch_http default_backend bk_elasticsearch_http backend bk_elasticsearch_http mode http option httpchk GET / HTTP/1.1\r\n http-check expect string lucene_version balance roundrobin option log-health-checks default-server check inter 2s fall 3 rise 2 server datanode-00.example.com_9200 datanode-00.example.com:9200 server datanode-01.example.com_9200 datanode-01.example.com:9200 server datanode-02.example.com_9200 datanode-02.example.com:9200 server datanode-XX.example.com_9200 datanode-XX.example.com:9200
/etc/metricbeat/modules.d/elasticsearch-xpack.yml: #----------------------------- Elasticsearch Module -------------------------- - module: elasticsearch enabled: true period: 10s # Get Elasticsearch status from data nodes via a locally installed HAProxy hosts: ["http://127.0.0.1:9201"] xpack.enabled: true scope: cluster