Endpoint 7.13 migration to 7.13.1 Lesson learned with Fleet “On-Prim” -Bad

Failed. Kibana 7.13.1 migration from 7.13.0 failed.

Testing setup. Single node. Fleet policies with Elastic-Endpoint only. Agent log's set to enable on settings to upload.

When the Elastic instance went offline for updates all agents started to buffer again in memory "Filebeat". This forced all machines to become unresponsive within 10minutes of each other. Forcing a hard crash of all test machines.

This is now the 2nd critical failure that has taken client devices offline that has been a different root cause but same results. 7.13 upgrade from pervious 7.12.1 caused the drives to fill up now it's back to memory starvation.

If I run into more I'll add on to this post.

7.13 offline agent issues. Not caused by Endpoint but by Agent.

Filebeat duplicate process is back. This is causing the read/write to continue endlessly as one writes updates and the other reads.

Also makes me wonder why Elastic-Endpoint run's as the user and not as system? You log out and the machine is no longer protected which is common to have server's never logged in. This just showed up in 7.13.

elastic-endpoint.exe is launched via a service as the SYSTEM user. Do you mean elastic-agent.exe or beats are running as a user? How did you install? They too should be started as a service and restart on reboot.

Can you share details about running processes, CPU/memory usage before, during, and after the upgrade?

Are these physical test machines or VMs? How big are those?

Nope I do mean as the user. It's not all the time I'll PM you a screen shot on the next one I see. Due to usernames it won't be public. It's one that makes you scratch your head. It's interesting to note that at you will end up with 2 instances of Endpoint running. This does not end well for obvious reasons.

Install is done from the the zip file directly from the downloads page. It is then unzipped with and installed from that folder. Pervious versions are uninstalled and the Elastic folder is deleted and the machine restarted prior.

The OS that I have seen this most common in is Server 2016 fully patched. I have not seen it happen on Windows 10 LTSC or current Pro versions.

Normally when this happens it's a scramble to remove it to restore services as it will make the machine unresponsive.

Server 2016 with IIS or File services installed are generally when this happens. We do not nest machine services so no shared process/services or conflicts. 2vCPU/8GB Ram/44Gb HD is the base we use if it's windows only services.

7.12.1 was on average on the dev servers was sitting at 1% CPU Endpoint with 200Mb memory when idle and 800Mb on disk. From what I've seen this is generally expected behavior from modern AV/HIPS clients "Carbon Defense, FortiEMS, AMP to name a few". It would spike up to 45% CPU utilization and tap out around 2Gb RAM utilization. It never went past those. It did cause some problems on tablets.

7.13.1 on the same file server you will see memory usage consume the entire amount that it can get it's hands up. No policy changes nothing. No other installs have changed. This looks like it's related to what 7.13 was doing so instead of filling the entire drive with the STATE log's it's keeping them in memory.

Thanks, I await the PM. elastic-agent.exe may occasionally launch an instance of elastic-endpoint.exe to do some internal bookkeeping. Perhaps that's what you're seeing. If you could make sure to include PID, PPID, and command line as well as the user in the screenshot, and make sure all elastic-agent.exe and elastic-endpoint.exe are visible that would be great. I understand if there are bits you want to black out in the screenshot.

Which executable are you referring to here? elastic-endpoint.exe or filebeat.exe? Am I right to presume you're referring to the same memory use/CPU use concern you described in this other post (Endpoint 7.12.x migration to 7.13 Lesson learned with Fleet "On-Prim")?

I've had both push the machine into memory starvation mode. Filebeat tends to be by far the most common but endpoint will start in if you leave the machine long enough. At which point your users have caught you long before that point. I left a few devices over the weekend and came in to see endpoint and filebeat taking 50/50 share.

Just to be clear this is the instance when filebeat is enabled in - Fleet, Policies, select policy, settings, collect agent logs. This does not happen when IIS integrations directly is installed and running filebeat, that is the only one I've tried so far.