Endpoint 7.13 migration to 7.13.1 Lesson learned with Fleet “On-Prim” -Bad

Failed. Kibana 7.13.1 migration from 7.13.0 failed.

Testing setup. Single node. Fleet policies with Elastic-Endpoint only. Agent log's set to enable on settings to upload.

When the Elastic instance went offline for updates all agents started to buffer again in memory "Filebeat". This forced all machines to become unresponsive within 10minutes of each other. Forcing a hard crash of all test machines.

This is now the 2nd critical failure that has taken client devices offline that has been a different root cause but same results. 7.13 upgrade from pervious 7.12.1 caused the drives to fill up now it's back to memory starvation.

If I run into more I'll add on to this post.

7.13 offline agent issues. Not caused by Endpoint but by Agent.

Filebeat duplicate process is back. This is causing the read/write to continue endlessly as one writes updates and the other reads.

Also makes me wonder why Elastic-Endpoint run's as the user and not as system? You log out and the machine is no longer protected which is common to have server's never logged in. This just showed up in 7.13.

elastic-endpoint.exe is launched via a service as the SYSTEM user. Do you mean elastic-agent.exe or beats are running as a user? How did you install? They too should be started as a service and restart on reboot.

Can you share details about running processes, CPU/memory usage before, during, and after the upgrade?

Are these physical test machines or VMs? How big are those?

Nope I do mean as the user. It's not all the time I'll PM you a screen shot on the next one I see. Due to usernames it won't be public. It's one that makes you scratch your head. It's interesting to note that at you will end up with 2 instances of Endpoint running. This does not end well for obvious reasons.

Install is done from the the zip file directly from the downloads page. It is then unzipped with and installed from that folder. Pervious versions are uninstalled and the Elastic folder is deleted and the machine restarted prior.

The OS that I have seen this most common in is Server 2016 fully patched. I have not seen it happen on Windows 10 LTSC or current Pro versions.

Normally when this happens it's a scramble to remove it to restore services as it will make the machine unresponsive.

Server 2016 with IIS or File services installed are generally when this happens. We do not nest machine services so no shared process/services or conflicts. 2vCPU/8GB Ram/44Gb HD is the base we use if it's windows only services.

7.12.1 was on average on the dev servers was sitting at 1% CPU Endpoint with 200Mb memory when idle and 800Mb on disk. From what I've seen this is generally expected behavior from modern AV/HIPS clients "Carbon Defense, FortiEMS, AMP to name a few". It would spike up to 45% CPU utilization and tap out around 2Gb RAM utilization. It never went past those. It did cause some problems on tablets.

7.13.1 on the same file server you will see memory usage consume the entire amount that it can get it's hands up. No policy changes nothing. No other installs have changed. This looks like it's related to what 7.13 was doing so instead of filling the entire drive with the STATE log's it's keeping them in memory.

Thanks, I await the PM. elastic-agent.exe may occasionally launch an instance of elastic-endpoint.exe to do some internal bookkeeping. Perhaps that's what you're seeing. If you could make sure to include PID, PPID, and command line as well as the user in the screenshot, and make sure all elastic-agent.exe and elastic-endpoint.exe are visible that would be great. I understand if there are bits you want to black out in the screenshot.

Which executable are you referring to here? elastic-endpoint.exe or filebeat.exe? Am I right to presume you're referring to the same memory use/CPU use concern you described in this other post (Endpoint 7.12.x migration to 7.13 Lesson learned with Fleet "On-Prim")?

I've had both push the machine into memory starvation mode. Filebeat tends to be by far the most common but endpoint will start in if you leave the machine long enough. At which point your users have caught you long before that point. I left a few devices over the weekend and came in to see endpoint and filebeat taking 50/50 share.

Just to be clear this is the instance when filebeat is enabled in - Fleet, Policies, select policy, settings, collect agent logs. This does not happen when IIS integrations directly is installed and running filebeat, that is the only one I've tried so far.

seems like you're hitting multiple issues here.
can you check if you see 2 or 3 filebeat/metricbeat processes running. if so this was fixed recently.
aside from that we can see what you describe with cumulating memory. this may be related to issue above or Investigate cummulating memory when process cannot reach agent · Issue #26242 · elastic/beats · GitHub

I'll be honest I stopped using the agent and went back to legacy beats even in dev. The amount of problems it causes is far to much to keep spending time on. Normal day job is hitting the frantic season so testing beta is out the window for awhile.

The disk space issue where it killed the drive on 7.13 then the memory bugs on 7.13.1 it's just not something I have time to mess around with anymore, sorry. Please set a HARD limit on what resources it can use. I know it's in the works but it's acting as closer to malware atm. Sorry but I'm backing out of testing Fleet and Endpoint until 8.3 or later. I have faith in you guys to get the laundry list of bugs sorted out. It has huge promise and is for more secure then username/password setup.

I just hit the disk filling issue too - got about 2x10^6 events and my server sending a crazy amount of logs to ES

I assume you mean the disk on the computer running Agent, not your Elasticsearch node? Can you share the paths to the files on your computer are filling up the disk.

For "got about 2x10^6 events" can you share the indices these events are going to?

Is this issue what you are seeing? If so, an upgrade to 7.13.0+ should resolve the issue.

It's the Endpoint State folder that blows up and will consume everything it can but at that point it will no longer send to ES so sounds like slightly different issue. If you happen to have fast storage you'll see it happen in minutes.

It happens when the agent receives the unenroll command for the version change. It's almost like it's being told to log locally vs sending with no limit. Nothing is removed which is expected in this situation as you don't want to remove your defenses even if they don't report. If the agent was offline at the time of the upgrade it does not happen when they attempts to connect again.

@hilt86 On the profile do you have send agent logs and metrics enabled? I've ran into issues with agent logs enabled. After reinstall 7.13 it ran into a death loop where it was logging that it was write a log in the endpoint state folder and would grow. Not as drastic as you got hit with but I only lasted a few days with it. Did you jump directly to 7.13.1?

7.13 is DISK swamped issue. "Upgraded from ES 7.12.1"
7.13.1 is RAM swamped issue. "Upgraded from ES 7.13"

I wasn't able to reproduce either issue on Windows 10. I installed each original version with the default configuration, which includes collecting logs, and added the Endpoint Security integration then upgraded to the target versions via Fleet. Upgrades worked as expected.

Are there detailed reproduction steps that anyone experiencing this can share so we can see it diagnose and fix it? If it seems to happen randomly, are there any environmental factors that might seem to trigger the issues?

Did you go from 7.12.1 to 7.13 or to 7.13.1?

I tried both. When looking into this other, similar issue (Elastic Agent filling up disk space with logs, disaster) I stumbled upon an issue that could have been what caused the disk size issue you saw, we'll put in a mitigation for that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.