Elastic Endpoint 8.13.0 constantly crashing on Server 2022

After upgrading from 8.12.1 to 8.13.0 the endpoint service on Windows Server 2022 machines is constantly crashing and restarting. This is not happening on Server 2016 or 2019 machines. The agent output is to elasticsearch.

Looking through the endpoint logs there aren't any messages to indicate why the service crashed, and the crash appears to happen at different spots in the program.

Here's the logs at the time one crash happened, notice the pid changes indicating a service restart

{"@timestamp":"2024-03-29T18:32:05.7618361Z","agent":{"id":"8dc34bcb-c361-4469-9c3f-1dc6ded7f4a5","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"debug","origin":{"file":{"line":1095,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:1095 Channel connectivity state: 2","process":{"pid":10416,"thread":{"id":5628}}}
{"@timestamp":"2024-03-29T18:32:06.7635506Z","agent":{"id":"8dc34bcb-c361-4469-9c3f-1dc6ded7f4a5","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"debug","origin":{"file":{"line":1095,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:1095 Channel connectivity state: 2","process":{"pid":10416,"thread":{"id":5628}}}
{"@timestamp":"2024-03-29T18:32:23.642367Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":226,"name":"Logging.cpp"}}},"message":"Logging.cpp:226 Endpoint info: version: 8.13.0, compiled: Wed Mar 20 21:00:00 2024, branch: HEAD, commit: f90579240155fc17f659ed37f7864ab1194ed2ea","process":{"pid":3348,"thread":{"id":3148}}}

Then here's the logs at another crash:

{"@timestamp":"2024-03-29T18:39:23.7973516Z","agent":{"id":"8dc34bcb-c361-4469-9c3f-1dc6ded7f4a5","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"debug","origin":{"file":{"line":257,"name":"Utility.cpp"}}},"message":"Utility.cpp:257 Document logging directory is: C:\\Program Files\\Elastic\\Endpoint\\state\\documents","process":{"pid":3348,"thread":{"id":7820}}}
{"@timestamp":"2024-03-29T18:39:23.7976467Z","agent":{"id":"8dc34bcb-c361-4469-9c3f-1dc6ded7f4a5","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"debug","origin":{"file":{"line":358,"name":"DocumentLogging.cpp"}}},"message":"DocumentLogging.cpp:358 Document logging directory size: 110656","process":{"pid":3348,"thread":{"id":7820}}}
{"@timestamp":"2024-03-29T18:39:24.4095944Z","agent":{"id":"8dc34bcb-c361-4469-9c3f-1dc6ded7f4a5","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"debug","origin":{"file":{"line":1095,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:1095 Channel connectivity state: 2","process":{"pid":3348,"thread":{"id":5936}}}
{"@timestamp":"2024-03-29T18:39:39.7810079Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":226,"name":"Logging.cpp"}}},"message":"Logging.cpp:226 Endpoint info: version: 8.13.0, compiled: Wed Mar 20 21:00:00 2024, branch: HEAD, commit: f90579240155fc17f659ed37f7864ab1194ed2ea","process":{"pid":540,"thread":{"id":7828}}}

There's appears to be no consistency to what is causing the service restart. Memory and CPU usage for the service appears normal at the time of the crash.

Uninstalling and reinstalling endpoint does not fix the issue. Changing the agent to a different fleet policy, without endpoint, then changing back to a policy with endpoint also does not fix the issue. A reboot of the server also does not fix the issue.

Anyone else seeing this issue?

Hi @twilson. I'm sorry you're experiencing crashes. Could you please check for Endpoint crash dumps in the following locations?

C:\Program Files\Elastic\Endpoint\cache\elasticendpoint.dmp
C:\Program Files\Elastic\Endpoint\cache\CrashDumps\*.dmp

If found, would you mind sharing them with us? Memory dumps usually compress well, so consider zipping them. I created this secure upload link specific to this case.

I don't have a support account/contract, and the forum credentials aren't working for that link. Or maybe it's just Friday afternoon and I can't type my password properly.

There was a dump file in c:\program files\elastic\endpoint\cache.

Sorry. Try this link: Elastic Upload Service : Upload

File has been uploaded.

Additionally, I'm starting to see the same behavior on Server 2019 and 2016, just not as frequently as is happening on Server 2022.

Received, thanks. Going through it now.

Thanks for the memory dump. I think I've found the culprit. Would you mind setting this in Defend Advanced Policy then hitting Save in the bottom right corner?

windows.advanced.events.api_disabled:SetThreadContext

It looks like this:

image

I've made the change to the setting. I'll reply back on Monday to let you know if this helped.

Thanks. Have a great weekend.

After making the change on Friday the endpoint service crashes have stopped - didn't have a single crash all weekend.

1 Like

Thanks @twilson. We're disabling the problematic feature globally for now while engineering is working on a fix.

@twilson We just deployed a configuration update to the cloud that will prevent this issue on all affected systems, regardless of whether the aforementioned policy workaround has been applied. All systems with access to https://artifacts.security.elastic.co should receive it within one hour.