I started writing this as a reply on this issue: Offer a lightweight Elastic Agent · Issue #3364 · elastic/elastic-agent · GitHub, but then it morphed into it's own thing and I decided it would be better off here. I really hope this is taken in a constructive manner, I sure tried to word everything in at least a neutral tone...
The following is not meant to be anything more than me airing some thoughts, and maybe triggering some constructive discussion about things. Emphasis on the constructive part.
Why So Much Space
I was just thinking about Agent's use of space again since I clicked on one of the notification emails for this issue. I started really wondering why Elastic is so set on Agent being all in one. As in, why does it have to do everything from metrics, to logs, to endpoint protection?
Is there any kind of documentation on the overall architecture and why it was designed that way? I did a little searching, but couldn't find anything.
I know there was mention of network issues causing difficulties deploying agent, but is there more to it than that?
Flaky networks seem like something that should not be Agent's problem to solve. Right?
I also wonder how much much overhead it would actually create if Agent supported downloading on demand. The components are already built for being deployed separately, so why can't Agent just act as a configuration tool that can deploy them when it needs to?
(Also, there is at least one similar agent out there that can do most of what Agent does in less than 50mb. I'd switch to it, but that'd lose the pre-built Kibana dashboards/visualizations/etc. that are so very helpful.)
What are the actual benefits?
In addition to the disk space issues just from installing Agent, I've honestly only ever found Agent to cause problems, rather than do anything better than when I wasn't using Agent.
Well, maybe it'd be nice if I had Windows servers to deploy it to. Upgrading via Fleet might be easier than other methods on Windows.
So what does Agent do that is worth it?
My negative experiences
A few of the things I've run into over the years:
Feature drift between Agent and Beats
Some features to work only with Agent, and some only with the Beats. Docker log autodiscovery/module configuration (as of the last time I tried a few months ago) is not supported by Agent, while it is supported by Filebeat. Time Series Data Streams do not have any kind of auto configuration support in Metricbeat, but they do in Agent. Those two issues mean I'm running Agent and Filebeat (where I have the disk space) at the same time.
I'm also pretty sure I've run into integrations that don't have equivalent Beats modules or inputs. I'm guessing those integrations are just custom config for the beats, but since Agent encrypts config, and when I've succeeded in decrypting it, I haven't been able to figure out the how everything is put together in a way that lets me translate it to what I do with normal Beats.
Can't actually configure it properly
In addition to the disk space issue Agent doesn't support configuring all the settings it has via the Fleet UI. But it also doesn't support applying those settings via the yaml config without manually re-enrolling the Agent one node at a time (at least I think you can apply it that way). See this issue.
Web UI based configuration
Web UI based configuration is painful when I'm used to being able to just use Ansible to template a text config file. Especially when you need to disable, or modify, just one small bit of a policy for a single or, a couple, nodes. With Ansible it's pretty much copy/paste/edit a couple lines, and you're done. With the UI, it's clone the policy, then click through who knows how many forms to adjust the settings you need to adjust. Then, if you need to add a new integration to all your nodes, you have to add it to all your existing policies one at a time....
End
Anyway, the reason I'm posting this is that I really do like the ELKStack. I just keep running into frustrating roadblocks that prevent me from making full use out of it. Agent has been one of those roadblocks for some important systems that I really want monitored, but can't because of the issues I've mentioned here.
The ELKStack is still the best for aggregating logs and metrics into one system, as far as I can tell, so I really want to see it get better. I'm not a Java/Go dev, and a subscription is not possible at work (I tried for months to get one, just wasn't possible to afford the subscription that had the stuff we actually needed), so the most I can do to support the ELKStack is try to use it, and try to make decent suggestions and help requests. And, well, in this post, complain a bit. :\
(Off topic, I really really really wish Elastic would release a Logs/Metrics only edition. Something optimized for JUST logs and metrics aggregation and monitoring. ideally designed to work on just 1 node.)
Thanks to anyone who took the time to read this. Have a good day!