Elastic Endpoint 8.13.0 constantly crashing on Server 2022

twilson · March 29, 2024, 7:30pm

After upgrading from 8.12.1 to 8.13.0 the endpoint service on Windows Server 2022 machines is constantly crashing and restarting. This is not happening on Server 2016 or 2019 machines. The agent output is to elasticsearch.

Looking through the endpoint logs there aren't any messages to indicate why the service crashed, and the crash appears to happen at different spots in the program.

Here's the logs at the time one crash happened, notice the pid changes indicating a service restart

{"@timestamp":"2024-03-29T18:32:05.7618361Z","agent":{"id":"8dc34bcb-c361-4469-9c3f-1dc6ded7f4a5","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"debug","origin":{"file":{"line":1095,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:1095 Channel connectivity state: 2","process":{"pid":10416,"thread":{"id":5628}}}
{"@timestamp":"2024-03-29T18:32:06.7635506Z","agent":{"id":"8dc34bcb-c361-4469-9c3f-1dc6ded7f4a5","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"debug","origin":{"file":{"line":1095,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:1095 Channel connectivity state: 2","process":{"pid":10416,"thread":{"id":5628}}}
{"@timestamp":"2024-03-29T18:32:23.642367Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":226,"name":"Logging.cpp"}}},"message":"Logging.cpp:226 Endpoint info: version: 8.13.0, compiled: Wed Mar 20 21:00:00 2024, branch: HEAD, commit: f90579240155fc17f659ed37f7864ab1194ed2ea","process":{"pid":3348,"thread":{"id":3148}}}

Then here's the logs at another crash:

{"@timestamp":"2024-03-29T18:39:23.7973516Z","agent":{"id":"8dc34bcb-c361-4469-9c3f-1dc6ded7f4a5","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"debug","origin":{"file":{"line":257,"name":"Utility.cpp"}}},"message":"Utility.cpp:257 Document logging directory is: C:\\Program Files\\Elastic\\Endpoint\\state\\documents","process":{"pid":3348,"thread":{"id":7820}}}
{"@timestamp":"2024-03-29T18:39:23.7976467Z","agent":{"id":"8dc34bcb-c361-4469-9c3f-1dc6ded7f4a5","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"debug","origin":{"file":{"line":358,"name":"DocumentLogging.cpp"}}},"message":"DocumentLogging.cpp:358 Document logging directory size: 110656","process":{"pid":3348,"thread":{"id":7820}}}
{"@timestamp":"2024-03-29T18:39:24.4095944Z","agent":{"id":"8dc34bcb-c361-4469-9c3f-1dc6ded7f4a5","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"debug","origin":{"file":{"line":1095,"name":"AgentComms.cpp"}}},"message":"AgentComms.cpp:1095 Channel connectivity state: 2","process":{"pid":3348,"thread":{"id":5936}}}
{"@timestamp":"2024-03-29T18:39:39.7810079Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":226,"name":"Logging.cpp"}}},"message":"Logging.cpp:226 Endpoint info: version: 8.13.0, compiled: Wed Mar 20 21:00:00 2024, branch: HEAD, commit: f90579240155fc17f659ed37f7864ab1194ed2ea","process":{"pid":540,"thread":{"id":7828}}}

There's appears to be no consistency to what is causing the service restart. Memory and CPU usage for the service appears normal at the time of the crash.

Uninstalling and reinstalling endpoint does not fix the issue. Changing the agent to a different fleet policy, without endpoint, then changing back to a policy with endpoint also does not fix the issue. A reboot of the server also does not fix the issue.

Anyone else seeing this issue?

gabriel.landau · March 29, 2024, 7:44pm

Hi @twilson. I'm sorry you're experiencing crashes. Could you please check for Endpoint crash dumps in the following locations?

C:\Program Files\Elastic\Endpoint\cache\elasticendpoint.dmp
C:\Program Files\Elastic\Endpoint\cache\CrashDumps\*.dmp

If found, would you mind sharing them with us? Memory dumps usually compress well, so consider zipping them. I created this secure upload link specific to this case.

twilson · March 29, 2024, 8:08pm

I don't have a support account/contract, and the forum credentials aren't working for that link. Or maybe it's just Friday afternoon and I can't type my password properly.

There was a dump file in c:\program files\elastic\endpoint\cache.

gabriel.landau · March 29, 2024, 8:10pm

Sorry. Try this link: Elastic Upload Service : Upload

twilson · March 29, 2024, 8:23pm

File has been uploaded.

Additionally, I'm starting to see the same behavior on Server 2019 and 2016, just not as frequently as is happening on Server 2022.

gabriel.landau · March 29, 2024, 8:25pm

Received, thanks. Going through it now.

gabriel.landau · March 29, 2024, 8:33pm

Thanks for the memory dump. I think I've found the culprit. Would you mind setting this in Defend Advanced Policy then hitting Save in the bottom right corner?

windows.advanced.events.api_disabled:SetThreadContext

It looks like this:

twilson · March 29, 2024, 8:49pm

I've made the change to the setting. I'll reply back on Monday to let you know if this helped.

gabriel.landau · March 29, 2024, 8:51pm

Thanks. Have a great weekend.

twilson · April 1, 2024, 12:57pm

After making the change on Friday the endpoint service crashes have stopped - didn't have a single crash all weekend.

gabriel.landau · April 1, 2024, 2:05pm

Thanks @twilson. We're disabling the problematic feature globally for now while engineering is working on a fix.

gabriel.landau · April 2, 2024, 4:49pm

@twilson We just deployed a configuration update to the cloud that will prevent this issue on all affected systems, regardless of whether the aforementioned policy workaround has been applied. All systems with access to https://artifacts.security.elastic.co should receive it within one hour.

system · April 30, 2024, 4:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.