Uncertainty around legacy collection method and Metricbeat for monitoring

I'm looking at switching an Elasticsearch 7 cluster from using legacy collection method for monitoring to using Metricbeat. I have found conflicting advice and information about the state of legacy collection methods and am confused about some of the advice for Metricbeat.

elasticsearch.yml on each node currently contains

xpack.monitoring.exporters:
  localhost:
    type: http
    host: ["list,of,remote,cluster,servers,here"]

The use of localhost there is poorly chosen by me or a colleague in the past because the monitoring data is being sent to a remote cluster. The Kibana Upgrade Assistant tells me:

The [xpack.monitoring.exporters.localhost.host] settings are deprecated and will be removed after 8.0

Which I read as meaning they're gone in 8.1, and as we'll be upgrading to something newer than 8.0 we need to switch to Metricbeat first. But there is documentation for legacy collection method in for 8.18 and "latest".

If we keep the legacy collection methods and upgrade to 8.18 will sending monitoring data still work?

Aforementioned documentation page says:

Ideally install a single Metricbeat instance configured with scope: cluster and configure hosts to point to an endpoint (e.g. a load-balancing proxy) which directs requests to the master-ineligible nodes in the cluster. If this is not possible then install one Metricbeat instance for each Elasticsearch node in the production cluster and use the default scope: node.

I read that and think the scope: cluster set up creates a single point of failure for the monitoring data of all nodes. And single points of failure are bad, especially if other options exist. And the scope: node set up doesn't have a single point of failure. Legacy collection method doesn't have a single point of failure. Why is switching to a set up with single point of failure "Ideally" what I should do?

Metricbeat with scope: node collects most of the metrics from the elected master of the cluster, so you must scale up all your master-eligible nodes to account for this extra load and you should not use this mode if you have dedicated master nodes.

I found

but from that I take away that it would work fine, but it might be "wasteful". Why should we not use Metricbeat with scope: node for clusters with dedicated master nodes - #7 by DavidTurner is self contradictory because it says the earlier statement "the only way for collecting monitoring data is actually the scope: cluster way." is "correct", but then says that scope: node works, which means the statement scope: cluster being "the only way" is not correct.

Is it actually fine to use scope: node if the cluster has dedicated master nodes (which is does), so long as the dedicated master nodes have enough CPU/RAM?

In scope: node mode most of the metrics come from the single instance attached to the master node, so if that Metricbeat process dies (but the master node remains alive) then you will get almost no monitoring data, and this is harder to detect than getting absolutely no monitoring data. Moreover if the master node is overloaded with metrics collection then the entire cluster will stop working until another master can be elected, but then this new master will be serving the same metrics and can be expected to be similarly overloaded.

Basically there's single-points-of-failure either way, but at least in scope: cluster mode a problem is likely easier to detect (no metrics delivered) and resolve (just restart the Metricbeat process if it fails) and won't take down the whole production cluster if it goes wrong.

You are right that you'll probably get away with scope: node if you run master nodes that are all sized large enough to handle metrics collection, as long as you monitor all the associated Metricbeat processes carefully.

Thanks, that's really useful detail which makes the documentation make more sense.

Is it also the case with legacy collection methods monitoring that most of the load of providing monitoring data is on the current master node? Our dedicated master nodes have been stable for years with that set up. Given that documentation shows legacy collection method still exists in Elasticsearch 9 I may just leave the cluster using that as it's working fine.

It's broadly similar yes (although at least with legacy collection the master node isn't putting effort into rendering enormous JSON responses to those stats APIs, nor is its local Metricbeat process putting effort into parsing those responses again).

Thanks. Given that what Kibana Upgrade Assistant is telling me about legacy collection methods being removed after 8.0 turns out to be incorrect as legacy collection methods is documented for Elasticsearch 9, (I guess plans changed after the version of Kibana we're currently running was released), I'm not seeing any point in expending the effort in switching to using Metricbeat until removal of legacy collection methods forces it. Legacy collection methods actually seems like the better option to me even if started from scratch as it doesn't require installing something extra to Elasticsearch.

1 Like