We've been happy users of the ELK stack and endpoint security components for quite some time now. During the past few months we've been living with occasional, weird performance issues which seem to be related to the mere presence of the agent on some machines.
This sort of issues appears only on some Windows 10 machines, we've observed it on some servers but to a much more limited extent.
What we see is SvcHost consuming a huge percentage of CPU time and resources, even though the actual activity we observe on the host experiencing this issue is very low. RAM consumption remains extremely low at all times.
- This does not appear to be related to endpoint security, because if we totally disable the endpoint security integration on the specific endpoint we observe little changes (from 90% down to 85%, for example).
- Disabling the individual components of the security suite (ransomware, malware, memory, ...) makes up for an even less noticeable difference.
- Disabling only events collection from the endpoint security integration makes no difference.
- Entirely removing the endpoint security integration from the policy (yup, for good measure we tried this too) doesn't make a difference.
- If we enable metrics collection, either via the generic
system integration or via the specialized
windows integration, we do see some noticeable difference in CPU usage. It still does not account for a sizable amount, but we do see a 20% difference, which is already significant.
- Entirely uninstalling the Elastic Agent, of course, does make a difference and the system load gets back to low thresholds.
- Adding SvcHost and a couple of other system binaries to Trusted Applications does not seem to help either, no changes were observed in the agent's behaviour.
I want to specify that we are always talking about machines which do very little work: this specific machine I am thinking about now is barely attempting to check-in with its domain controller every now and then, without actually be doing anything special. In fact, without the agent the overall system load is extremely low (between 5 and 10 %).
I'm kinda lost unfortunately has anyone in here experienced similar issues? Is there any meaningful additional information I should share for additional context?
BTW all of our agents are at version 7.17.0, with ongoing activities to get up to speed with the new 8.* release.
P.s.: I add that the Elastic Endpoint component remains functional at all times and that
endpoint-security metrics do not seem to highlight any particolar issue with the agent's behaviour on the client: everything seems to be working fine. Of course, it's not something I developed and I might be wrong
I haven't seen similar behavior, do you have the same high level usage if you do the following:
- remove endpoint
- remove system integration
- add a simple log integration that tails an empty file?
The above will give us a baseline, nothing would be actively done.
Is the cpu usage is coming from elastic-agent or one of the other binary run by the agent? Filebeat/metricbeat.
Is there any error log in any of the Elastic Agent's logs?
Hey @pierhugues , thanks for reaching out!
We haven't tried doing the actual unenroll, but I can tell you that:
- Metricbeat is seen doing a fair bit of work (roughly 20% CPU), if and only if we enable actual collection of performance metrics, for example from perfmon via the
- Filebeat is not seen anywhere in the top 10 resource consumers.
- If we completely empty the policy, removing every integration, CPU consumption is still extremely high.
- If we disable metrics collection, and therefore Metricbeat, CPU usage lowers down a little, but not significantly.
There appear to be absolutely no error messages from the agent.
As far as CPU usage is concerned, depending on whom you ask you'll get two different answers:
A) If you ask the Windows Task Manager, the top consumer is Elastic Agent-
B) If you ask sysmon / perfomon, the top consumer is SvcHost.exe with its DnsCache.
I am more inclined to beliave the latter.
Should we maybe try and update to 8.2, just to see if anything improves?
We did update some endpoints already, we haven't updated this one yet.
Upgrading would be a good idea, so we can get a diagnosics from the machine. (this is not available in 7.17)
Thanks for your help and for the suggestion: we've just upgraded and in fact we are observing some improvement, even though the CPU usage is still pretty high.
To sum it up:
SvcHost.exe at ~45%
SearchApp.exe at ~18%
Metricbeat at ~14%
Filebeat at ~13%
What puzzles me is the insane amount which seems to be done by SvcHost, even though it's not true at all and the system is barely doing DNS queries at all.
I can't seem to find any way to get a diagnostics, and I'm probably blind but can't seem to find anything relevant on the official docs, would you mind giving me a couple of directions to look into?
EDIT: Nevermind, I found the relevant docs regarding the
diagnostics command, I'm afraid it won't be easy to do as this endpoint is not anywhere close to us right now.
Is there any specific information which I should be looking for, or I which should share with you, in case you need to specifically investigate this issue?
Thanks for your help till now!
I would like to see the logs files and see if there is anything there, that would be my first thing. I am not sure yet what would cause that such high level of cpu for svchost.
hi @popeio , svchost could mask a myriad of issues. Since the problem also appears when the Elastic Agent is effectively not running any integrations (including monitoring?) the things I can think of is some group policy configuration or, if DNSSEC validation enabled could be causing this when the elastic agent is checking in with Fleet. Can you review those again (also if you are using DNSSEC)? Would also be interesting to unenroll the elastic agent, keep it running and let us know if anything changes with the CPU usage.
Hi @MarianaD, we actually sort of "tried that by mistake", because we unenrolled the agent first from the SIEM, before actually uninstalling the agent. During that time, no apparent different was recorded by the end user on the PC (but I'd be a bit cautious about this info: it's one thing when we are monitoring it and (in)directly looking at
perfmon, it's another thing when it's an external user, albeit a rather technically capable one, making the report).
For what concerns your observation about DNSSEC, heck: that's a valid concern! We do have DNSSEC active on the domain name we use with our Elastic SIEM.
If I manage to obtain a
diagnostics from the agent as @pierhugues suggested a few days ago, may I drop the output in here, or do you have some other way of sharing this sort of information? I'll review and redact sensible details before sharing, of course.
Thanks for the support!
Hey @popeio, I had something similar when I was working with one of our IIS servers. After working with the Elastic engineers, we were informed to add
C:\Windows\System32\inetsrv\w3wp.exe as trusted applications.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.