For those who are interested here is my experience deploying Metricbeat with a monitoring cluster ("Stack Monitoring").
In a nutshell:
- When you monitor a non-trivial Elasticsearch cluster, be sure to add
scope: cluster
to your Metricbeat configuration. - Elastic should "backport" major documentation changes to all supported versions.
- I used HAProxy to monitor my Elasticsearch cluster with Metricbeat. See my config below, but let me know if you have suggestions or improvements.
Here are the details.
I have installed a monitoring cluster to monitor my Elastic Stack platform with X-Pack. I use Metricbeat and Filebeat to ship data to this monitoring cluster.
My Elasticsearch cluster has dedicated master-eligible nodes and I noticed that Metricbeat on the active master node used more resources than expected. This is also reported elsewhere on the forum. I even had performance issues with master elections. I now think that Metricbeat may have contributed to this problem.
When I installed the monitoring cluster I used the documentation for Metricbeat 7.10/7.11 that was current at the time. It says:
Install Metricbeat on each Elasticsearch node in the production cluster. Failure to install on each node may result in incomplete or missing results. By default, scope is set to node and each entry in the hosts list indicates a distinct node in an Elasticsearch cluster.
So I installed Metricbeat and enabled the elasticsearch-xpack.yml module on all Elasticsearch nodes (scope: node
was and still is the default).
But when I read the documentation of more recent Metricbeat versions I noticed that Elastic's advice has changed:
Ideally install a single Metricbeat instance configured with scope: cluster and configure hosts to point to an endpoint (e.g. a load-balancing proxy) which directs requests to the master-ineligible nodes in the cluster. If this is not possible then install one Metricbeat instance for each Elasticsearch node in the production cluster and use the default scope: node. [...] Metricbeat with scope: node collects most of the metrics from the elected master of the cluster, so you must scale up all your master-eligible nodes to account for this extra load and you should not use this mode if you have dedicated master nodes.
They now mention that scope: node
increases the load on the master nodes, which I can confirm. This means that scope: cluster
is now recommended for non-trivial setups! When I adapted this for my Metricbeat setup the CPU usage on my master node dropped considerably. Perhaps I missed it but I could not find this important change in the Metricbeat release notes. David Turner of Elastic created an issue in GitHub and updated the documentation but this change was not added to the online guide of older Metricbeat versions that are still supported (versions 7.10 and 7.11 at the time of this writing).
My suggestion to Elastic would be to "backport" important changes to the documentation of all supported versions.
So now the online guide recommends the following if you want to monitor a non-trivial Elasticsearch cluster with Metricbeat:
- Use a single(!) instance of Metricbeat.
- Only query nodes that are not master-eligble.
- Add the parameter
scope: cluster
to your Metricbeat configuration. - Add the parameter
hosts: <end-point>
where<end-point>
is a single end-point, not a list of data nodes.
Unfortunately this single end-point parameter is not discussed extensively in the documentation. Elastic don't even provide a recommendation. I believe the different options that you have would include:
- Your Elasticsearch cluster already has a single end-point, for example an Elastic Cloud instance.
- You pick a single data node at random. This does not provide redundancy so if the data node goes down, so does your monitoring.
- You create a single end-point that consists of a round robin DNS record with the IP addresses of multiple data nodes. Avoid using DNS caching on the Metricbeat host. This is a poor man's implementation of load-balancing. If a data node goes down then there will be some impact on monitoring.
- You add a coordinating node to your Elasticsearch cluster which acts as load-balancing proxy to the data nodes. This does not provide redundancy, so if the coordinating node goes down: no more monitoring.
- You add a load-balancing proxy with multiple data nodes as members. This could be your cloud provider's load-balancer such as AWS ELB, a hardware based load-balancer such as an F5, or a software based load-balancer.
I decided to go for the software based HAProxy because it is light weight and supports health checks. I installed HAProxy on the same system that is running Metricbeat. Does anyone have comments on the config below? Would you consider a health check interval of 2 seconds too aggressive? I am checking for the Elasticsearch HTTP banner that contains the Lucene version and the tagline. Or would a simple TCP port check be sufficient?
If desired I could post my Filebeat configuration files as well. These ship log messages from my Elastic Stack platform to the monitoring cluster. The configs are relatively straightforward. I only had to make sure that log messages generated by Metricbeat are filtered out by Filebeat.
/etc/haproxy/haproxy.cfg:
global
# Comment out the next line to disable logging Metricbeat requests
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
# Inspect stats with: hatop -s /run/haproxy/admin.sock
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user haproxy
group haproxy
daemon
defaults
log global
mode http
option httplog
option dontlognull
option forwardfor except 127.0.0.0/8
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
retries 3
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout http-keep-alive 10s
timeout check 10s
maxconn 3000
frontend ft_elasticsearch_http
bind 127.0.0.1:9201 name elasticsearch_http
default_backend bk_elasticsearch_http
backend bk_elasticsearch_http
mode http
option httpchk GET / HTTP/1.1\r\n
http-check expect string lucene_version
balance roundrobin
option log-health-checks
default-server check inter 2s fall 3 rise 2
server datanode-00.example.com_9200 datanode-00.example.com:9200
server datanode-01.example.com_9200 datanode-01.example.com:9200
server datanode-02.example.com_9200 datanode-02.example.com:9200
server datanode-XX.example.com_9200 datanode-XX.example.com:9200
/etc/metricbeat/modules.d/elasticsearch-xpack.yml:
#----------------------------- Elasticsearch Module --------------------------
- module: elasticsearch
enabled: true
period: 10s
# Get Elasticsearch status from data nodes via a locally installed HAProxy
hosts: ["http://127.0.0.1:9201"]
xpack.enabled: true
scope: cluster