Endpoint 7.12.x migration to 7.13 Lesson learned with Fleet "On-Prim"

EDIT: Use caution on 7.13 and laptops/tablets. Normal runtime on my test machines are 7 hours. With Endpoint 7.13 its down to 1.5 hours due to excessive CPU utilization. This is AMD and Intel platforms with current Win 10 patches.

This is purely from my own point of view from using a larger test fleet. This may not apply to you. This only applies if you use Fleet at all.

Original dev setup. 7.12.2 with Endpoint deployed to 40 test machines. Endpoint is purely detection mode only. Will move to a much larger test fleet in the coming week.

The migration went silky smooth from 7.12.2 to 7.13. This is now been 2 migrations that didn't fail! Thank you dev's for listening to the crys of the poor admin's that already do way to much. A very nice touch was added in that is now pulling kibana away from being the constant source of failures and it starts and will finish the migrations in the background. Please do not ever remove this!

7.13 is a ripe and replace change. The only thing that will survive the update is the policy names. All agents will be disconnected. You will see a warning notice and a link to the changes "very nice". What it does call out in the notes is that the agent's will have been sent the unenroll command. This FAILS and leaves you in a rather nasty spot. Nothing is removed. Now you have agents that are disconnected and buffering logs to send to a server that is actively rejecting them. Disk space = gone on agent side. This really only applies to machines with smaller drives, think point of sale terminals or very low utilization servers. To counter this unenroll all agents PRIOR to upgrading! Please note this still leaves the Elastic Agent installed and running on the machine. You will have to script it's removal.

With the addition of the new fleet service proxy you will need another VM/Container or two at the least in order to see the benefits. You will not need 1 per policy. I only glanced at the notes and didn't read them in detail but that never stood out to me. You can start to see the load balancing of endpoint which has been sorely needed for sometime now.

Another annoyance is if you have any data from 7.10.x agents up to 7.12.2 agents still in your indices you can pretty much count on the SIEM part not showing events. This is partly due to the version changes to the ecs that changed. The alerts will not be in error but you clearly will see nothing appearing. Removing the old index if you don't need the data or waiting for the ILM to remove it and you'll start having events.

On a plus side the ILM policies where not reset to default this time so you wont come back and find your drives are full. Always fun to see 20Tb disappear for logs on only a few machines in a matter of days.

Access Denied when registering the agents is far more common in 7.13 then it has been in the past making scripted deployments more difficult.

Overall this so far has been well worth the upgrade! 7.12.x was a train wreck. 7.13 seems to have corrected over a dozen issues that were starting to pop up. Only time will tell but it's worth the update and extra leg work. As far as event detection I haven't tested anything yet as the release version has only been out for 24 hours.

Thank you Elastic Dev's you guys are awesome!

4 Likes

Hi @PublicName,

I work on the Endpoint team and I just wanted to take a moment to thank you for the detailed feedback as you test the product.

-nf

Adding to the original with a few more tiny things I've run into that might help someone else.

7.13 cluster performance is drastically improved. Back to a normal state.

Again the MAJOR warning. Unenroll PRIOR to upgrading or you will have a denial of services on your hands. The upgrade does not unenroll the agent's like it says. It tells the agent to log locally and the logs will fill your drives. 40 test servers offline with C drive full. This did not appear to effect Linux devices. I can not speak for MAC's.

Unenrolling 7.13 does not unenroll agent correctly. The status will stay in Updating state and the client will remove everything except agent. From what I've seen it will stay this way indefinitely. This may be tied to Filebeat will continue to run after Endpoint is removed.

Attempting to install with powershell with GPO fails Win 10 LTSC. The installer runs correctly but when it's time to copy "agent" to the correct location it never happens so the install fails. Same script works on 7.12. No idea what this is caused by but easy dirty fix copy the Agent to the Endpoint folder and it works as expected. I'm also not a guru scripter so take that as you see fit.

Nice touch on the Threat Intel tab. The guides don't really say how to fully configured that past the feeds. Guess some tinkering is in order.

Would be really nice to see hard limits set for CPU/Memory and now disk utilization be set at a max percentage of free space lets say 50%/1025MB/2%. Options to override would be helpful as well. At no point should your endpoint security device be the thing that takes you offline that's rather ransomware like...

Well this is back again. Enable agent logs on a policy 7.13.1.

Hi @PublicName

We are playing around with that type of feature, although I'm not sure when it will be ready for release.

In the meantime, is elastic-endpoint.exe using more CPU than you'd like? If so we'd be happy to dig into it a bit. One common cause of high CPU is two antivirus products monitoring each other in an endless loop, although other applications can do things as well that put more stress on Endpoint than you'd like. Adding a Trusted Application in Security -> Administration often resolves that type of issue.

If the issue is with elastic-endpoint.exe there are two ways you can find what is causing it to use a lot of CPU. One is to look at the latest data_stream.dataset : endpoint.metrics document (found in a metrics-* data steam index) for the misbehaving Endpoint. In 7.13 we added Endpoint.metrics.system_impact details to this document, which is a list of programs on the computer that are causing Endpoint to do a lot of work. The week_ms value in each entry is the number of milliseconds spent over last week, the higher the value the more likely this is the cause of high CPU use for elastic-endpoint.exe.

Another option is to follow the guidance here (Endpoint agent consistent 90+% CPU for some PCs - #13 by Matt_Scherer) which outlines a way to create a Lens visualization to see what programs are causing Endpoint to produce the most data, which is likely to correspond with what is causing high Endpoint CPU.

Regardless of which route you take, its important to not create a Trusted Application for something like svchost.exe, which would create a large security blind spot in your network.

And you hit the nail on the head. Not wise to whitelist that process the causes the problems. Pervious was TIWorker which you guys already addressed. Thx Microsoft for shoving everything into one process. For sanity I white list Windows Defender all Fortinet products and several applications. All of which the update wiped out and reset to a blank slate.

I don't collect the agent metrics/logs due to it being rather CPU inefficient at the moment on the cluster. So I'll go a little back in versions as to what made me stop the collection. Prior to fleet/endpoint I was using Metricbeat on ~500 devices grabbing all disk status, process, cpu, network, uptime with a 60sec polling time. That was more then enough to get what we needed. All said it done I kept the data for 2 weeks which ended up being 1.2Tb and that ended up with very tiny delays 10ms which I was able to contend with on changing a few settings. I was able to get away with this on 4CPU's per elastic node and had no performance issues at all. It was instance searches. Fast forward to the 7.12 branch added 4CPU's per node and kicked the memory up to max for the license and I'm having endless delays and failures in searches due to the amount of indices that have task assigned to them. I even went to the point of asking support which confirmed that the cluster was to heavily used and with no roll back options... Endpoint sends a lot to elastic which is good for detection but bad on the wallet.

I've had to stall 7.13.x testing as it burned me with the upgrade process sucking 100% drive space on 40 endpoints. The issued auto unenroll was an unwise design call. Well at least it's Beta.

Sorry to say for the time being I'm back to the classic beats for collections even for dev.

Do you mean you created Alert Exceptions for them or Trusted Applications? Are those entries back in place or after the upgrade wiped them out did you not reenter them?

Trusted application.

When I checked both alert exceptions and all trusted were removed. Which isn't a huge deal as the list was really short but I can see when this scales to thousands of endpoints that it would be a serious problem.