What are the actual benefits of Elastic Agent?

I started writing this as a reply on this issue: Offer a lightweight Elastic Agent · Issue #3364 · elastic/elastic-agent · GitHub, but then it morphed into it's own thing and I decided it would be better off here. I really hope this is taken in a constructive manner, I sure tried to word everything in at least a neutral tone...

The following is not meant to be anything more than me airing some thoughts, and maybe triggering some constructive discussion about things. Emphasis on the constructive part.

Why So Much Space

I was just thinking about Agent's use of space again since I clicked on one of the notification emails for this issue. I started really wondering why Elastic is so set on Agent being all in one. As in, why does it have to do everything from metrics, to logs, to endpoint protection?

Is there any kind of documentation on the overall architecture and why it was designed that way? I did a little searching, but couldn't find anything.

I know there was mention of network issues causing difficulties deploying agent, but is there more to it than that?

Flaky networks seem like something that should not be Agent's problem to solve. Right?

I also wonder how much much overhead it would actually create if Agent supported downloading on demand. The components are already built for being deployed separately, so why can't Agent just act as a configuration tool that can deploy them when it needs to?

(Also, there is at least one similar agent out there that can do most of what Agent does in less than 50mb. I'd switch to it, but that'd lose the pre-built Kibana dashboards/visualizations/etc. that are so very helpful.)

What are the actual benefits?

In addition to the disk space issues just from installing Agent, I've honestly only ever found Agent to cause problems, rather than do anything better than when I wasn't using Agent.

Well, maybe it'd be nice if I had Windows servers to deploy it to. Upgrading via Fleet might be easier than other methods on Windows.

So what does Agent do that is worth it?

My negative experiences

A few of the things I've run into over the years:

Feature drift between Agent and Beats

Some features to work only with Agent, and some only with the Beats. Docker log autodiscovery/module configuration (as of the last time I tried a few months ago) is not supported by Agent, while it is supported by Filebeat. Time Series Data Streams do not have any kind of auto configuration support in Metricbeat, but they do in Agent. Those two issues mean I'm running Agent and Filebeat (where I have the disk space) at the same time.

I'm also pretty sure I've run into integrations that don't have equivalent Beats modules or inputs. I'm guessing those integrations are just custom config for the beats, but since Agent encrypts config, and when I've succeeded in decrypting it, I haven't been able to figure out the how everything is put together in a way that lets me translate it to what I do with normal Beats.

Can't actually configure it properly

In addition to the disk space issue Agent doesn't support configuring all the settings it has via the Fleet UI. But it also doesn't support applying those settings via the yaml config without manually re-enrolling the Agent one node at a time (at least I think you can apply it that way). See this issue.

Web UI based configuration

Web UI based configuration is painful when I'm used to being able to just use Ansible to template a text config file. Especially when you need to disable, or modify, just one small bit of a policy for a single or, a couple, nodes. With Ansible it's pretty much copy/paste/edit a couple lines, and you're done. With the UI, it's clone the policy, then click through who knows how many forms to adjust the settings you need to adjust. Then, if you need to add a new integration to all your nodes, you have to add it to all your existing policies one at a time....


Anyway, the reason I'm posting this is that I really do like the ELKStack. I just keep running into frustrating roadblocks that prevent me from making full use out of it. Agent has been one of those roadblocks for some important systems that I really want monitored, but can't because of the issues I've mentioned here.

The ELKStack is still the best for aggregating logs and metrics into one system, as far as I can tell, so I really want to see it get better. I'm not a Java/Go dev, and a subscription is not possible at work (I tried for months to get one, just wasn't possible to afford the subscription that had the stuff we actually needed), so the most I can do to support the ELKStack is try to use it, and try to make decent suggestions and help requests. And, well, in this post, complain a bit. :\

(Off topic, I really really really wish Elastic would release a Logs/Metrics only edition. Something optimized for JUST logs and metrics aggregation and monitoring. ideally designed to work on just 1 node.)

Thanks to anyone who took the time to read this. :slight_smile: Have a good day!

We are starting to use the Terraform module from Elastic that supports Fleet: Terraform Registry
If there is no current version of Elastic for Ansible, maybe you are better of with using the API directly? It may be more effort at first but may solve the issues that you have with the UI: Kibana Fleet APIs | Fleet and Elastic Agent Guide [8.13] | Elastic

In my case the agent is worth by the time it saves me with managing and deploying different kinds of configurations to thousands of agents including servers and workstations.

I do not have admin permissions on all servers because there are multiple teams, so pushing the configuration locally would be a nightmare, with the agent I just need the infra team to install it and I can then use Fleet to configure what will be collected from each host.

It was necessary to build some automations using the API for things like, automatically move agents between policies, notify when an agent is added, automatically add tags based on the policy name etc.

We also had some issues and negative experiences similar as yours:

Disk space requirements

This was the main one when we started using it as it created some attrition with the infra teams, most of the linux servers have small os disks and it was required to increase the disk of thousands of servers just to be able to install the agent.

I also do not understand the one agent to rule them all that Elastic choose, as mentioned in a comment on the github linked in the first post there are at least 2 different uses for the Agent, one is to collect logs and metrics from edge hosts, servers or workstations, and the other is to collect logs from external sources like TCP, UDP, Kafka, EventHub, API endpoints etc.

This could be 2 different tools, one smaller to get logs and metrics, and another one with more features, or maybe even improve Logstash and use it as the default tool for those kind of log collection, but unfortunately Logstash feels abandoned.

In my case I just need to get logs from servers and workstations, but I need to have the full agent installed.

Feature drift between Agent and Beats

Well, I think that sooner or later the individual beats will be deprecated and phased out, so I would not expect to every integration to have an equivalent filebeat module since the focus is the Agent now.

The main issue here for me is the documentation, sometimes you need to look at the filebeat documentation to do something in the Agent, unfortunately the documentation for integrations lack a lot of things, mostly examples, there is basically 0 configuration examples in the documentation.

Can't actually configure it properly

Yeah, I also suffered with this, there are some things that are not exposed on the Fleet UI and you can only do if you use the API, I opened at least 2 issues with feature requests about things like that.

My two biggest issues with the Agent are related to customizations of pipelines, mappings or lifecycle policies and how the integrations works.

The customizations in the Agent needs to be done on the dataset level, and some integrations can have a lot fo datasets, for example, if I want to add a custom field to the Google Workspace integration I would need to create, edit and manage something around 14 custom templates because this integration has 14 datasets.

Fortunately this was also seem as an issue by Elastic and it is being improved and tracked here to add more levels of customization.

Another issue is that integrations cannot be shared by policies, if I have 10 policies with the System integration and need to make a change on it, I will need to edit 10 integrations.

If it was possible to share integrations in policies this would make things way easier and you wouldn't end up with many duplicate integrations.

Not sure if there is any progress about this on Elastic side as I didn't look for any issue about it.

1 Like

I looked at Terraform a while back and it didn't meet my needs. Can't recall exactly why. I should check it out again.

I have poked at Kibana's api a small bit, haven't looked at what you can do with Fleet though.

Thanks for pointing that out. :smiley:

Ok, that is a very good use case. Thanks!

Your mention of Elastic actively working on needed issues reminds me that I do appreciate that they listen to feedback and do a lot of work based on that feedback. The time series data streams I think were as a result of a lot of people complaining about how much disk space you needed for storage, and TSDS's really helped with that. When I switched to them the space savings for metrics was huge.

There is a reason I'm still using the ELKStack and haven't just switched to something else. :slight_smile:

At least you could increase that disk space. One of my frustrations with Elastic, and other freemium companies, is that they forget that some people don't have "enterprise" resources available.

This right here for us too. I'm at the stage of running both agent and separate filebeat instance (where feasible) until the agent integrations mature enough. Some integrations work out of the box, others are buggy (putting in tickets on those), and some don't offer file ingestion yet like their counter filebeat module does (there are workarounds for this, but would be a bit of a time commitment).

Glad we're not alone on this, and as someone mentioned, it's a newer course that will need a bit of maturing and customer feedback for a bit.

Best wishes to those dipping into elastic agents.

1 Like