We've been happy users of the ELK stack and endpoint security components for quite some time now. During the past few months we've been living with occasional, weird performance issues which seem to be related to the mere presence of the agent on some machines.
This sort of issues appears only on some Windows 10 machines, we've observed it on some servers but to a much more limited extent.
What we see is SvcHost consuming a huge percentage of CPU time and resources, even though the actual activity we observe on the host experiencing this issue is very low. RAM consumption remains extremely low at all times.
- This does not appear to be related to endpoint security, because if we totally disable the endpoint security integration on the specific endpoint we observe little changes (from 90% down to 85%, for example).
- Disabling the individual components of the security suite (ransomware, malware, memory, ...) makes up for an even less noticeable difference.
- Disabling only events collection from the endpoint security integration makes no difference.
- Entirely removing the endpoint security integration from the policy (yup, for good measure we tried this too) doesn't make a difference.
- If we enable metrics collection, either via the generic
systemintegration or via the specialized
windowsintegration, we do see some noticeable difference in CPU usage. It still does not account for a sizable amount, but we do see a 20% difference, which is already significant.
- Entirely uninstalling the Elastic Agent, of course, does make a difference and the system load gets back to low thresholds.
- Adding SvcHost and a couple of other system binaries to Trusted Applications does not seem to help either, no changes were observed in the agent's behaviour.
I want to specify that we are always talking about machines which do very little work: this specific machine I am thinking about now is barely attempting to check-in with its domain controller every now and then, without actually be doing anything special. In fact, without the agent the overall system load is extremely low (between 5 and 10 %).
I'm kinda lost unfortunately has anyone in here experienced similar issues? Is there any meaningful additional information I should share for additional context?
BTW all of our agents are at version 7.17.0, with ongoing activities to get up to speed with the new 8.* release.
P.s.: I add that the Elastic Endpoint component remains functional at all times and that
endpoint-security metrics do not seem to highlight any particolar issue with the agent's behaviour on the client: everything seems to be working fine. Of course, it's not something I developed and I might be wrong