Issues with Elastic Agent and Defender?

After an upgrade to 7.12, a few machines experience high CPU load from Elastic endpoint security.

The load seems to be periodic:


The three processes/services that seem to cause this load are Elastic Endpoint, MS Defender, and "System".

I've attempted to record the behavior with Windows Performance Analyzer (I can provide an .etl file, however only to Elastic Support - I'm not confident to upload it publically as I'm unsure about the contents of those files). I'm not familiar with the tool, but at first glance it seems like Microsoft Defender and Elastic Endpoint are in some kind of a deadlock regarding files like%ProgramFiles%\Elastic\Endpoint\state\documents-xxxx.log:

The issue did not occur in 7.11. We're not using the new ransomware protection. Elastic Endpoint is not registered as an Antivirus solution in windows (and I'd like to avoid this and keep MS Defender next to Elastic Endpoint).

1 Like

Hello @nemhods. Thanks for reaching out.

Running multiple security products simultaneously can create conflicts where the products repeatedly intercept each others' attempts to scan files on the system. This can create feedback loops, spiking CPU and I/O. One way to break such loops is to have the security products ignore each other. Try adding C:\Program Files\Elastic\elastic-endpoint.exe as a Defender exclusion. Then, in your Elastic Security app, add Defender (likely C:\Program Files\Windows Defender\MsMpEng.exe) as a Trusted Application.

Hopefully that helps.

Thanks for the suggestion. I've dug a bit further, disabled MS Defender, but the problem still persists.
I've discovered that whenever the high-cpu activity by elastic-endpoint ends, it sends a burst of data, leading me to believe that it's some kind of a data collection job by elastic-endpoint:

So I tried to find out what Elastic-endpoint is actually doing while the CPU usage is high. Procmon shows that even without MS Defender in the background, the service is still doing some heavy interaction with files in C:\Program Files\Elastic\Endpoint\state\documents*.log

Do you happen to know what these files are for? Is it supposed to do all this activity every few seconds?

BTW this happens on Win10 20H2 as well as Server 2019 1809.

Hi @nemhods

Those files are temporary logs Endpoint uses to cache documents before writing them to Elasticsearch. In 7.12 we reduced the frequency Endpoint writes documents from every 2 minutes to every 30 seconds. It's possible this has put additional strain on your host causing the higher CPU use.

It would be useful if you could override this delay for Endpoint to see if setting it back to 2 minutes solves your issue. To do that, please go to the Endpoint policy for the host in question and click "Show advanced settings" at the bottom of the page. Next scroll down and set windows.advanced.elasticsearch.delay to 120, then save the policy so it reapplies.

If that fixes your issue we'll find a better solution in Endpoint so you don't have to leave that override in place long term. Please be aware that setting it to 120 manually can cause occasional race conditions where Detection Engine alerts don't fire.

Hey @ferullo,
that seems to have changed something. I applied the policy shortly after 20:00.

You can see the the load spikes appear in a different rhythm, although I couldn't say whether it's actually gotten better... Then also the visualization may interfere here with the interval setting etc...

Is there a way I could properly profile the elastic agent? Or will I have to do some trial & error by tweaking the policy to get to the bottom of this?

Thanks for testing that change. Feel free to roll it back since it didn't solve the issue.

We've noticed that documents-*.log caches that have all documents already written to Elasticsearch aren't being removed right away. They are eventually removed when a max disk space cap is reached but ought to be removed sooner. Because they aren't removed Endpoint keeps needlessly re-scanning them for documents that need to be written to Elasticsearch.

Assuming that is causing the issue you're seeing, an easy way to resolve this for now is to remove the Endpoint Security integration from the host then re-add it. When the Endpoint Security integration is removed Endpoint will be uninstalled from the host and those files will be cleaned up.

We'll address the need to prune documents-*.log files sooner in an upcoming release.

2 Likes

Your last suggestion seems to fix the issue, thanks! I waited a fair few minutes before re-adding the integration to make sure it actually gets uninstalled.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.