Is Version 8 with Elastic Agent stable in medium to large environments?a

Hi there,

I am in charge of a couple clusters that are used for security and log retention. They are both on version 7.17.3 and both have about 5000 endpoints. We are going to have to upgrade to version 8.x at some point, but have some concerns about the performance of the elastic agent and the fleet server on an environment of our size.

We are currently using six logstash servers as part of our ingestion pipelines to break up the traffic a little bit between the endpoints and the elastic stack. Our biggest concern at the moment is network performance across our environments when we get rid of those logstash servers and open our stack up to that many individual connections.

I am curious if anyone on here has already upgraded to version 8 in a similarly sized environment and what your experience has been.

Thanks,
Alex

While not at the scale you're at (I manage currently ~500 agents), I think I could provide some feedback here:

  1. In the more recent version of 8.x, Logstash has been added as an output option to Fleet Managed Elastic Agent integration (I think it is currently in Beta for Fleet managed but appears to be going to GA in 8.4). So, if you really wanted to keep Logstash in the mix, you could.
  2. For my setup, I use dedicated Ingest & coordinating nodes, this seems to help with scaling at the larger end of deployments. (Regarding performance, I think at the scale you're running, it would probably be ideal to maybe have dedicated coordinating & Ingest nodes, and let them handle the connections, so that your data nodes don't take the hit. Though this would also really depend on the integrations you're running in your policies)
  3. In a more recent version of 8.x, you can now also specify different Elasticsearch URLs for different Agent Policies, so if you want to manually load balance by policy, this is also possible.
  4. Fleet Server scalability | Fleet and Elastic Agent Guide [8.3] | Elastic provides some good info on the Fleet Server scalability side of things.
  5. Elastic seems to be redesigning how the Elastic Agent will ship data long term (GitHub - elastic/elastic-agent-shipper: Data shipper for the Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.) which may change things eventually, but as of right now, Elastic Agent data shipping is very similar to Beats as it really is just Beats under the hood.
2 Likes

The agent and fleet team has been scale testing the agent management functionalities in fleet, at much larger scale than 5K and has also seen growing usage at 5k+ scale.

I am curious if your current data collection methods already use fleet managed elastic agents or something else? are you using built-in modules/integrations or have custom data transformation logic built in logstash/ingest pipelines?

As Ben mentioned, agent output to logstash was released in 8.2 and will be GAed in 8.4 so you should be able to use your existing ingest architecture by configuring agents to collect and ship data to your existing logstash instances.

Ben thanks for the reply.

  1. I am curious about this one. I spent a little time looking at the documentation, but it seems like this would mess with the functionality of the elastic agent a little bit. One of the things about the elastic agent that we are excited about is the ability to push upgrades to our beats infrastructure (or at that point to the elastic agent infra). It seems like if you have a logstash in the data flow you won't be able to communicate directly with the endpoints in order to affect the changes you want to your environment.
  2. I think that this is an interesting suggestion, but at this point I am not worried about the load on my data nodes. We have pretty beefy servers right now, and even though we are ingesting at a pretty high rate the servers that are ingesting (all our hot data nodes) are not even sweating. It is tricky because the performance that I am most worried about is the network that all the traffic is traveling over. I don't think the change to the elastic agent should be that much more intensive, but I don't really have a good way of ballparking the amount of data that our elastic infrastructure is responsible for in this environment.
  3. I am going to have to look more into this, because I am not sure I follow. By having multiple different urls would you be sending the data to multiple different clusters?
  4. I was looking at this documentation and I was a little frustrated as it is specifically related to clusters that are being run in the cloud. Unfortunately, we are working with a self managed cluster and the stuff in that doc doesn't really apply to us.
  5. This will be interesting to see how they roll this out.

Thanks again for your response and getting back to me.

Mukesh,

Do you know if the fleet and agent team has been monitoring the differences in network traffic between fleet/elastic agent traffic and the old setup (beats, logstash, elastic, etc)?

We don't use fleet at all at this time. Everything we have is self managed. We have about 5000 endpoints sending winlogbeat data to six diff logstash servers depending on location. I think this also reduces the strain on the network a little bit, but as I am writing this I am not sure that theory actually makes sense. Anyhow, once the data hits the logstashes it is all forwarded on to the elastic cluster. The logic on our logstashes is fairly simple at the moment, but we are working on several enrichments and logic to replace functionality we lost when we migrated from splunk (and the lookup tables that existed there). So replicating that in the fleet architecture is another concern.

Using the output to logstash architecture in 8.4 will we still be able to utilize the fleet agent to initiate upgrades of the agent? If that is the case I am definitley feeling a little bit better about the migration.

Thanks for your time.

Hi @alaine

Regarding your questions:

  1. I am curious about this one. I spent a little time looking at the documentation, but it seems like this would mess with the functionality of the elastic agent a little bit. One of the things about the elastic agent that we are excited about is the ability to push upgrades to our beats infrastructure (or at that point to the elastic agent infra). It seems like if you have a logstash in the data flow you won't be able to communicate directly with the endpoints in order to affect the changes you want to your environment.

For this one, communication with the Elastic Agents shouldn't be affected. The Agents use the Fleet URL to do all communication regarding stuff like configuration. The Elasticsearch/Logstash "Outputs" this is just where the data collected by the Agent gets sent.

  1. I think that this is an interesting suggestion, but at this point I am not worried about the load on my data nodes. We have pretty beefy servers right now, and even though we are ingesting at a pretty high rate the servers that are ingesting (all our hot data nodes) are not even sweating. It is tricky because the performance that I am most worried about is the network that all the traffic is traveling over. I don't think the change to the elastic agent should be that much more intensive, but I don't really have a good way of ballparking the amount of data that our elastic infrastructure is responsible for in this environment.

For this one, if you setup the same integrations as the modules you have on your Beats, I would expect the network utilization to be roughly the same. If possible, you can try to monitor network usage between Beats & Logstash, and that I think would provide a similar estimate for Elastic Agent to Elasticsearch.

  1. I am going to have to look more into this, because I am not sure I follow. By having multiple different urls would you be sending the data to multiple different clusters?

Not entirely, you can have something like a proxy setup, that would allow you to set one URL that hits nodes X,Y,Z of the cluster, and another URL that hits nodes A,B,C. Or you could just set a URL per node. Output URLs are either cascading (first one needs to fail before hitting second) or round-robin, I wasn't able to find much in the docs on this one. Someone on the Elastic end, might be able to provide more context here.

  1. I was looking at this documentation and I was a little frustrated as it is specifically related to clusters that are being run in the cloud. Unfortunately, we are working with a self managed cluster and the stuff in that doc doesn't really apply to us.

See this issue for more info on this topic: [Discuss] Using multiple fleet-servers · Issue #903 · elastic/fleet-server · GitHub