Elastic Agent 7.14 -- Strange bug during enrollment "Elastic fleet agent bug"

Elastic Agent 7.14 on-prim Elastic cluster. Effected OS Windows Server 2016/2019 both OS's have the same issue.
Error pulled from a failed enrollment attempt of the client directly.
Fleet agent in prod is running on CentOS Stream. Test Fleet server was Ubuntu.

Steps used:

  1. Remove 7.13.4
  2. Install windows updates.
  3. Reboot server.
  4. Delete the existing Agent folder from C:\Pro~\Elastic\Agent
  5. Enroll agent from the 7.14 download.
  6. Start chanting no whammies, no whammies, no whammies.
  7. If whammy occurs go to Fleet unenroll failed agent. If you attempt to re-run the enrollment again you will have duplicate agents and it will be stuck in updating state forever.
  8. Restart fleet server agent. Reenroll.

//
2021-08-16T11:46:18.366-0700 ERROR cmd/watch.go:61 failed to load markeropen C:\Program Files\Elastic\Agent\data.update-marker: The system cannot find the file specified.
2021-08-16T11:46:18.591-0700 INFO [composable.providers.docker] docker/docker.go:43 Docker provider skipped, unable to connect: protocol not available
2021-08-16T11:46:18.593-0700 INFO [api] api/server.go:62 Starting stats endpoint
2021-08-16T11:46:18.594-0700 INFO application/managed_mode.go:291 Agent is starting
2021-08-16T11:46:18.594-0700 INFO [api] api/server.go:64 Metrics endpoint listening on: \.\pipe\elastic-agent (configured: npipe:///elastic-agent)
2021-08-16T11:46:18.694-0700 WARN application/managed_mode.go:304 failed to ack update open C:\Program Files\Elastic\Agent\data.update-marker: The system cannot find the file specified.
2021-08-16T11:46:19.069-0700 WARN [tls] tlscommon/tls_config.go:98 SSL/TLS verifications disabled.
2021-08-16T11:46:19.330-0700 ERROR fleet/fleet_gateway.go:205 Could not communicate with fleet-server Checking API will retry, error: status code: 400, fleet-server returned an error: BadRequest
//

This happens when the elastic fleet agent hits 600+ MB of memory usage. Restarting the service and you are good to go and the error will not show again until the fleet agent is 600+ MB usage.

After the fleet agent restarts in Kibana GUI:
circuit_breaking_exception: [circuit_breaking_exception] Reason: [in_flight_requests] Data too large, data for [<http_request>] would be [8875977956/8.2gb], which is larger than the limit of [8589934592/8gb]

This has also causing another issue I posted about. Excessive network sessions. Was finally able to track that down to the same issue above. When the fleet agent stall's out you will have clients or the fleet agent starting to sending thousands of open network sessions. Average per agents is 17,000 which turns out to be 1GB traffic per agent. I mean it's fun seeing if your network is build to handle massive session limit's but not good for production as you can saturate uplinks.

Hi @PublicName

Thanks for sharing. There's a similar issue we are fixing, detailed https://discuss.elastic.co/t/unable-to-communicate-with-fleet-server-after-upgrade-to-7-14/280388

To help us triage this faster, how large would you say the policy is that you are using? do you see the same issue with a smaller policy? (more details here: Fleet: policy aren't assigned to agents (flaky) · Issue #27299 · elastic/beats · GitHub)

thanks

Policy size is small near defaults. Only thing I have added is a path to my CA for Endpoint as it never wants to read from the Windows cert stores correctly. I can change them at will not encounter the same issues that are in the Github. What is interesting is the overall similarity of the failure and I wouldn't count it out being the same root cause.

Policy size does not seem to matter I had 1 test group assuming that was a possible cause and it didn't seem to matter and that was basically a spam fest of everything I could through at a policy just to see what works and doesn't. OSQuery is not one of the friendly ones and you will have to destroy the policy or your agents never want to recover.

Agent count in policy = 23
Agent versions - 7.13.4 - upgrading to 7.14
Integrations, Windows with powershell and sysmon + Endpoint in prevent mode.
Disabled metrics and logs for the agent as it was causing excessive amounts of duplicate logs and disable Network logging on Endpoint.

I do use the event filters heavily due to not needing to see every dropped packet from the Windows firewall so it's stripped by the filters. It seems to only effect ingest rate nothing else.

The compression may likely be causing issues as well. That may be what's causing the massive network spikes which just made me "just now" scramble to remove Elastic Agent + Endpoint from several dozen machines.

I wouldn't mind seeing a notice sent out about not using 7.14 in production due to severe issues. To be honest it shouldn't have been pushed to GA yet. It's still way to unstable to be considered anything past beta...

Could you share by chance the fleet-server logs?

Hey @PublicName - thanks for continuing to work with us. Can you explain a bit more about your osquery comment above. cc: @Melissa_Burpo - the PM for our osquery manager.

@ruflin I will. It will be a few days I'm going to attempt to trigger it on a physically separate network so I can get a packet capture at the same time. I've pissed off the user base a little to much to want to trigger it in the office... Rather keep my job.

@bradenpreston

Couple initial things with OSQuery. Seems it's a hit or miss if it will download the hash file. You just have to wait it out which can take up to an hour for it to complete. During this time the expected yellow warning of unhealthy will be listed for the agent. This is on a 1Gb down/1Gb up circuit mind you that is less then 3% utilized during the day. If you watch the downloads folder it will pull down the zip then delete it throw several errors saying the hash doesn't match. Just wait that part out.

When you have OSQuery and Endpoint in the default form you will end up with massive amounts of logs which will peg the CPU on any machine that it's running on as endpoint will step on it. At this point you have two choices let Endpoint eat up a large part of your CPU and RAM "Filebeat is the offender" and then the Agent will start getting overwelled as it will be sending thousands of events then it's CPU and RAM usage will start to climb. If you run a really messed up the query something that is way to aggressive like list every file on a machine that has been in service for 3+ years then do some WMI calls... You got the idea on that one it's an almost instant fail of the agent. Not directly OSQuery but it's a loop that you get stuck in. From that point on the policy even after you delete it the agents are stuck in permanent unhealthy state and will continue to eat away RAM and CPU on the host. Only way I was able to get around it was to move the agents to a new policy then delete the old one. During this time I did check the Fleet policy on the host machine and the reference was cleared for OSQuery. Restarting the agent gives you a moment of sanity.

Now it's worth mentioning that 95% of the VM's I do are Server 2016/19 2vCPU +8GB RAM. This is most common as things like file servers, web app proxies, CA's, DC don't need more then that. Running the same settings on a Ryzen 5800x with nvme drives running windows 10 "current release" resulted in the same spike but calmed down after 30 minutes. It may come down to simply having enough compute to beat out the next job scheduled.

It appears if your SACL's are set to audit changes you end up in a very bad situation. I have to have them on for some of our servers. That's the one difference that I'm going to guess is not common for more people or situations. By it's self you see a small difference in IO load on a machine comparted to one that doesn't have it enabled.

I have created: 7.14 Fleet server commun to track this enrollment issue

Hi @PublicName, we're definitely interested to investigate this issue. If available, could you please share the sets of queries you ran that lead to the agent failing? We'd like to try and set up a system where we can replicate the problem and see how we might handle it better. Thanks for any additional info!

@PublicName hi.

Hi, we've tested on all of the supported OSes, and with the details we have so far I'm not seeing the same problems.

So, I wanted to ask for some clarification, if I could - I know there is very good good discussion here in this thread, so thank you in advance and I hope nothing is a repeat, but apologies if it is!

Is there any way we can clarify on what was being tested too... on-prem vs 'prod' vs test agent. I don't understand, I may be missing context or not making intended assumptions.

Also, I don't know what certs/ CA usage is in play here and it is easy to have mistakes / problems between Windows + non-Windows in that regard. It would be helpful to confirm the basics, and start at the beginning - I'd like to see the (redacted) strings used on the command line to install Fleet Server as well as the command used in powershell or command-prompt in Windows to deploy that Agents.

And I don't mean to be difficult, but I do want to call out the Troubleshooting and FAQ questions in our formal Docs. If we are stuck on assessing a given point, I'd personally love to know where and what we can add in to make it better. Here is the troubleshooting doc: Troubleshoot common problems | Fleet User Guide [7.14] | Elastic

If we can move discussion to the git ticket that was logged it may help, do you mind?

Regards

@Melissa_Burpo I'll send you a PM with the query. It's perfectly fine to laugh when you see it. I know I do every time I even attempt to make one.

Not being difficult at all @EricDavisX.

"Strange bug" is why it's a problem to find and reproduce. Coding is fun bugs are more so...

Reference for CA to clear that part up:
Fleet Settings, Elasticsearch output configuration (YAML) -- Add the following line.
ssl.certificate_authorities: ["C:/Program Files/Elastic/Agent/ca.crt"]

For some reason the agent just never uses the windows cert store in my environment in prod or dev. I have dozens of different app's that use it so I know my cert's are good. I even check to make sure revocation functions correctly on a routine basis.

I've been running the self signed for fleet with --insecure behind the powershell enrollment process. I know it's terrible security practice. After the enrollment process stops changing over the versions I'll issues true certs. Using --insecure I don't have SSL issues as long as CA.CRT is located in the path above for the elasticsearch connection.

Agent is install from the RPM command when you setup fleet policy. Copy/Paste. Nothing fancy no custom options required. I'll have to see if the agent string still in bash history I'll send to you in a PM.

Normal Powershell install string:
.\elastic-agent.exe install -f --url=https://someservernamehere:8200 --enrollment-token=reallylongapikeyhere --insecure

Hopefully that cleared that up for you.

The 600MB Fleet failure is the puzzling one that I can't figure out. I have 3 environments I use.

  1. Production
  2. Dev
  3. Home lab

Production has 2 fleet servers in use and the agents are distributed at an attempt of 50/50 between them. The first sigs of failure are in Fleet the status for all will blip over to offline for a few minutes then back on. Restarting the service and it wont happen again for awhile. It's not time based that I can tell or even agent count locked.

Dev has fleet, kibana, elastic all on one VM. Same issue just after 600 it fails but not every time. It may run for a day sitting at 670 then just give up. Memory usage still shows 6Gb+ free on the host.

Home lab I have the same error message in the logs but trying to reproduce in that small of an setup is proving to be next to impossible.

I'll start moving over to Git but it's blocked where I work so it's after hours only.

Do you have a guide posted for the supported certificate types? For example the bulk of the world uses RSA keys. Where other industry's use EC certificates. We ran into an issues with a few app's years back and were forced into switching to EC enterprise wide. I've asked a few times before but no direct answer and haven't found it in my quick search over at Git. I do like to point this part out! I've been burned before thank you Cisco products for not using a standard over all platforms.

@Nima_Rezainia @EricDavisX

Cause of failure for the agent @600MB+ then fail.
I've been able to reproduce in 3 sperate environments cause a failure of the elastic fleet agent. And it's cause is pretty simple at least for test root cause... I hate coding so I wont look.

Have 4 different agent versions. Please note this is not ideal but fleet upgrades still fails extremally often and with 7.14 requiring a reboot its not like I can reboot at will. It's a scheduled process in production.

7.13.2 - most often to cause 7.14 issue and forces agent memory up over time until crashing. It takes 1 agent from my testing to do it.
7.13.3 - For some reason has no issues and works just fine.
7.13.4 - Existing agent does not report to 7.14 at all. All data is not showing in Elastic despite the agent logs show it's sending 500 documents every minute. What I have not tested was installing 7.13.4 directly on fleet 7.14.
7.14 - Works.

7.14 transform rule is causing massive CPU usage to the point of detrimental results through the entire cluster.
3 node cluster each with x16 vCPU's and 32GB ram backed by SSD only. Avg CPU in 7.13.4 was 16% cpu utilization with ~250 agents and legacy beat feeding with 0 issues. Was able to pull dashboards with data for 7 days near instantly. 7.14 trying to pull 1 hour of data is taking at least 18 seconds and reports are failing. If I shutdown the server running fleet agent and wait a few minutes for caches jobs to finish CPU drops to normal levels. Reenable fleet it spikes back to 90%+ average. Disable transform for just a moment and it drops instantly. After 6 agents it becomes very noticeable that something is a miss.

Hi, sorry for the delay in response. This is all incredibly helpful to diagnose, thank you for taking the time. I'll review in proper depth tomorrow and may tag a team member, this has been open longer... you may need more advanced help than myself.

For now:

  • I do wonder if you used port 8200 for the Fleet Server enrollment string on purpose or was it a typo? Fleet Server by default runs at 8220.

  • Also, when I hear mention of RPM install, my ears perk up, it is a source of confusion. The 'install' command cannot be used after you've installed an RPM file (or deb). You must alter the string to cite 'enroll' instead of install... this is in the docs, but maybe hidden a bit. Also, I see you cite the .exe and powershell... in what context does the RPM artifact come in to play??

  • finally, if you are using the Agent's self-signed certs and the --insecure command... you shouldn't have a need even for the C:\ path CA cert you mentioned, I don't think. Its the limit of my understanding, but an Agent team engineer could help confirm. It of course is probably fine if it is there, maybe it is ignored in this context, or maybe I'm just not clear on the usage in the larger Elastic or entire-system topology, that could be the case.

Regards

  • Correct type 8220 is the default.

  • Install rpm first then navigate to /usr/share/elastic-agent/bin then Enroll. I probably should have made that more clear my bad.

  • All agents create 2 connects you can see it in the logs and in tcpdump/wireshark. 1 connect to the Agent service and the other directly to Elastic. Not sure this is intended?? as it seems like it would create duplicate data if anything is sent to the agent as well. It's been this way since the start with 7.13. Even in --insecure it fails to connect to Elastic unless that is a self signed as well. Connecting to agent isn't a problem at all with --insecure.

CA path is due to the agent failing to validate against the windows cert store. I wish there was a cleaner way of doing it without adding it to the Fleet settings directly. Running every agent standalone is an alternative that would be way to difficult to live with day to day. "That's just for someone else that reads this later looking for a work around".

Hi @PublicName thanks, I'll do a re-review over the notes and issues discoursed and see if I can help, myself, with anything further. I say that as I *think I'm at the boundaries of my knowledge with the topics. :slight_smile: Always glad to learn more, though.

Thanks for confirming the 8220 typo, and the usage of 'enroll' that all sounds right then. The 2 connections sounds right and intended. The architecture when we can elaborate more formally is savvy enough not to (intentionally) be sending duplicate data. The Agent talks to Fleet Server. And the Agent then gives the configuration to the Beats / Endpoint, who can talk to ES. The 7.13 release is when we first introduced Fleet Server so the timing there makes sense.

I will re-review and confirm we have a bug logged for the CA item, hoping I can summarize it precisely enough. I don't think it is expected that a user would *have to use stand-alone Agent mode.

@PublicName I wonder if you are using full path for the certs or a partial path? That relates to a problem we fixed somewhat recently (I thin), though I don't have the issue handy to link it and provide confirmation of merges / timing. And sorry for the delay in response (again). I'm going to tap the Agent team for any continued review.

Full path. C:\Program Files\Elastic\Agent\

I did test it on Friday "7.14.1" to see if --insecure worked with both agent and endpoint just to see if it did change. Results = negative nothing is sent to Elastic despite the agent saying sending 500 documents. That is after restart elastic and the agent workstation. If you look in the logs-* for the agent name or ID for the past hour nothing is present. Adding back in and a 30 minutes every agent is sending again.

That says to me that fleet agent to agent comm works just fine but its agent to elastic.

Policies are very simple. Agent + Endpoint that's it on 90% of them. I don't use any of the other options at the moment due to inconsistences when the agent stops sending data. Legacy beats have been rock solid and considering most of my devices are PCI rather not risk the transition and loss data.

Off topic but the Threat Intel module setting it to run every hour made a huge different on all of my elastic clusters. The default time even on a fast dev machine with sub 50 machines feeding it on all SSD "nvme" would fall on it's face after about a day.