Monitor cluster with elastic agent

Hey Everyone,

We're trying to move from the legacy exporters over to Elastic Agent. Our data pipeline is Elastic Agent > Logstash > Kafka > Elastic. I have a couple of questions and would appreciate any and all knowledge anyone is wishing to share.

  • Will monitoring work if the underlying data stream names are changed
  • If it is a big cluster with dedicated master, how, warm, cold and coordianting nodes should I use the scope option for monitoring
  • If the scope option is selected, how do you make the agent collecting the metrics HA (highly available)

To briefly explain the first question, we are using Kafka's Elasticsearch connectors which require a type and dataset name to be specified.
As such the datastream name in ES would be metrics-something-kibana.stack_monitoring.stats-prod for example.

Regarding the second and third - What I concluded from the documentation is that you should use the scope option and add a LB url that balances across nodes which are not master-eligible (in our case those would be coordinating-only nodes). But how do you make it so if the elastic-agent collecting the metrics goes offline the metrics still keep coming.
If you were to have two agents, both collecting the same data from the same cluster, would they duplicate the metric data or does every pipeline have a static way to generate the doc _id field?

Sorry for the long post, here's a cookie :cookie: for those of you who made it, and thanks for any help in advance!

Cheers,
Luka

Hi Luka!

I'll try to answer the parts I can.

For the data stream name, the Stack Monitoring UI looks at the following pattern for data collected with Elastic Agent (or rather, data streams matching the naming conventions): metrics-elasticsearch.stack_monitoring.DATASET-* (where DATASET is for example cluster_stats.
As long as your data streams match this pattern the UI should work. From your example, the something part would break it. I'm afraid this isn't something you can configure.

The size of your cluster isn't really what dedicates the value of the scope config, but rather how you want to deploy your collection. If you use cluster you only need to deploy one agent which should target the master node to fetch all the needed data sets. But you might want to avoid this in a large cluster since that will cause more load on your master. So in that case using node might be better but then you need to deploy one agent per node (at least this is my understanding).
(see discussion below)

How you achieve HA on the agent likely depends on how you deploy your agent and stack, this is somewhat outside of the realm of the Elastic stack. If it's running as a side car in a Kubernetes pod for example then we would expect Kubernetes to take responsibility for that. In a raw deployment, I don't know. Who watches the watcher so to say. That's why we have alerting rules about monitoring data missing, so that's one option.

Having two agents collecting the same metrics would lead to issues since they have no way to coordinate around duplications.

Hi @miltonhultgren, thank you for the prompt response!

For the data stream name, the Stack Monitoring UI looks at the following pattern for data collected with Elastic Agent (or rather, data streams matching the naming conventions): metrics-elasticsearch.stack_monitoring.DATASET-* (where DATASET is for example cluster_stats .
As long as your data streams match this pattern the UI should work. From your example, the something part would break it. I'm afraid this isn't something you can configure.

Unfortunately you are correct about the something part breaking it. After disabling the legacy collection via:

PUT _cluster/settings
{
  "persistent": {
    "xpack.monitoring.collection.enabled": false
  }
}

The Stack Monitoring data completely stopped updating. Currently I'm monitoring one hot node, one Kibana instance and one Logstash instance. All the monitoring data is being ingested properly into ES (afaik), since the data streams are consantly getting new documents.

Is there any potential on "fixing" (not litreally since it ain't broke) the stack monitoring in the future to work like a dashboard? In the sense that it would look for logs-*/metrics-*, and then filter further using the datastream.dataset field, if that's even possible? Considering all the prebuilt dashboards work in this manner it might be a good idea to streamline that as well.

Guess I'll just jump over Kafka with metrics and only write logs, don't see another solution atm.

The size of your cluster isn't really what dedicates the value of the scope config, but rather how you want to deploy your collection. If you use cluster you only need to deploy one agent which should target the master node to fetch all the needed data sets. But you might want to avoid this in a large cluster since that will cause more load on your master. So in that case using node might be better but then you need to deploy one agent per node (at least this is my understanding).

That was my original understanding as well, but after reading this from the docs I'm a bit confused:

Elastic Agent will collect most of the metrics from the elected master of the cluster, so you must scale up all your master-eligible nodes to account for this extra load. Do not use this node if you have dedicated master nodes.

What I'm getting from this is that regardless of if you choose to monitor the cluster via the master, or a single node, it will still go to the master to collect cluster related information (this is an educated guess). Meaning that in my case with 100+ nodes it would probably outright crash it.

Having two agents collecting the same metrics would lead to issues since they have no way to coordinate around duplications.

Yeah, I figured as much after looking at the ingest pipelines. Thank you nevertheless, this is extremely valuable info!

I agree that it would be a good idea to align the Stack Monitoring UI with how dashboards work, that's however not on our roadmap I'm afraid.

One option might be to use an ingest pipeline and the newly added rerouting processor to direct the documents to the "right" data stream.

For the scope setting, my experience is that there are some limits to cluster size that we can monitor regardless of which mode you use simply because the collecting depends on getting the cluster state which is owned by the master eligible nodes and this is usually the dataset which becomes largest as the cluster grows huge.
So there isn't that much we can do to work around that beyond scaling up the master nodes, changing the polling frequency of the collection, increasing the timeout settings so the collection waits for the master node to compute and transfer the cluster state. Or, change the topology of the cluster (which usually isn't an option).

@DavidTurner Is my understanding correct here?

Not really, most of the APIs that Metricbeat hits will do all the hard work on the node handling the HTTP request and not the master. Using scope: cluster with a single Beat connected to a node other than the elected master is a very good idea and will significantly reduce the load on the elected master.

Great, thanks for clarifying!

Then, for a larger cluster, it is best to use scope: cluster and load balance across all the non-master nodes?

For my own understanding, does that include resolving the cluster state as well? Meaning the non-master node can resolve that without involving the master node? Or is it more that most of the other work is not being done by the master node which leaves resources left on the master node to resolve the cluster state request?

Right, that'd work well.

Not sure what you mean by "resolving" here. Metricbeat doesn't request the full cluster state AFAICT, only bits of it. It'd be even better if those requests used the ?local query parameter to keep that work completely off the elected master, but they're not really the expensive requests anyway. It's other things like shard-level stats that tend to cause the bigger problems, and they don't hit the master at all.

FWIW this concern is independent of the choice of scope: node or scope: cluster, you have a single point of failure either way. If using scope: node then things will stop working if the Metricbeat instance targetting the elected master fails.

Thanks for all the info guys!

Just to clarify, the scope: node option will make it so it only gathers the metrics from said node, while scope: cluster will make it so the node which gets the requests coordinates it further to get the statistics from every node. Did I get that right?

FWIW this concern is independent of the choice of scope: node or scope: cluster , you have a single point of failure either way. If using scope: node then things will stop working if the Metricbeat instance targetting the elected master fails.

Yeah that I get. In which case I'd much rather prefer to go with scope: node so that even if it fails it only fails for that one node. Every node will have a local elastic agent since they need to collect the logs either way, and the installation + enrollment is automated so not a big deal.

My main concern is that if we pick scope: node and have a local agent on each one targeting that node's respective API that the consequent requests which hit the master don't overload it (which shouldn't happen from what I've gathered in this thread), since the requests that do make it to the master node are not expensive.

That's not quite right. scope: node means it collects a few node-specific metrics from each node, but most of the metrics are cluster-wide things rather than node-specific ones, and scope: node means that those cluster-wide metrics are all only collected directly from the elected master. That imposes an enormous amount of extra load on the elected master in a big cluster, and means that if this Beat fails you get very few metrics.

You want scope: cluster.

Alright, that makes sense. Just figured that out due to this log:

Thank you once again for all the help!

TLDR for those who don't want to read everything:

  • The monitoring data streams it writes to HAS to match the standard schema <type>-<dataset>-<namespace>, only namespace can be changed
  • If you have a large cluster with dedicated master-eligible nodes use scope: cluster which points to a LB that balances across master-ineligible nodes
  • The agent is a SPOF (Single Point of Failure), unless on a platform that provides HA natively, like k8s
1 Like

Just a comment, unless something has changed, using the scope cluster will show the same transport address for every elasticsearch node.

So for every node the ip address you will see in Kibana monitoring will be the one for the node where the requests are being made.

We used scope cluster in the past, but this was confusing and we want back to using scope node and have one metricbeat per node, which was not an issue because we use it to get system metrics as well.

That sounds like a pretty bad bug, @miltonhultgren are you aware of this?

I reported it on a case to support on October last year, on one of the answers they mentioned that an internal enhancemente was created, this one: https://github.com/elastic/enhancements/issues/17248

I was on 8.4 at the time, not sure if this still persists because I'm using one metricbeat per node.

I'm on 8.8.1 atm. Can't really say if it's the case in Stack Monitoring, but it's not the case in the documents.
I switched from node to cluster atm just to see what it looks like, and the nodes' correct transport address is showing for each one.

I was not aware of this, I'll verify if this happens in either case (in the docs or the UI) and follow up with a fix if needed. @leandrojmp thanks for brining that to our attention!

@leandrojmp Just following up with that indeed the bug still exists, I've opened a PR to address it [elasticsearch] Always report transport address in node_stats by miltonhultgren · Pull Request #36582 · elastic/beats · GitHub

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.