We're seeing extreme latency on 2022 Domain controllers when Elastic Defend 8.6.2 Malicious Behavior rules are enabled. The server becomes very sluggish but performance metrics don't appear to show any sign of excesses load. Low CPU, Low Memory Usage, Low/normal network load, and low Disk activity. This seems to affect all our DC's but severity depends on how busy each server is. The most active will jump to from ping latency of .5ms to 1000+ in short order.
Since we don't see any perfmon counts that seem to indicate an issue all we have to go on is the servers responsiveness. Using icmp uptime monitors we can quickly see when the problem starts, and stops. The issue is extra hard to determine since a low load DC may not show the issue at all most of the time, but a high load one, maybe be fine for a few minutes before it becomes erratic.
Any thoughts on possible causes or methods to narrow this down further would be greatly appreciated.
Hi @Kelly_Slavens. The Discuss forum software normally emails me when Endpoint Security issues are posted. I'm sorry I don't know why I didn't get one for this issue. Please feel free to hop into the #endpoint-security room in our community Slack. You can usually get a response there pretty quickly during normal business hours.
First, thank you for isolating the performance issue to a specific policy toggle. That's a huge time-saver.
Would you mind sending us a copy of your diagnostics so we can see what's set in policy besides that behavioral protection checkbox? I created a secure upload link here specific to your case. You can collect diagnostics like this:
C:\Temp>"C:\Program Files\Elastic\Agent\elastic-agent.exe" diagnostics
Created diagnostics archive "elastic-agent-diagnostics-2023-04-24T16-16-03Z-00.zip"
The malicious behavior protection system requires a variety of event types to function. Even if those events aren't configured to stream to Elasticsearch, the Endpoint still need to collect and enrich them to make them available to the behavioral rules engine. Event sources can be forcefully disabled using advanced policy options. While these advanced switches are useful for troubleshooting, they may or may not be an ideal long-term solution because some data sources are used for features besides events and behavioral protection. You can try setting these and seeing if performance improves, either one-by-one or binary search (divide and conquer) like git bisect does.