On Windows, services have a variety of options for automatic restarts, depending on the failure modes (similar to systemd serviced options).
The problem is that the FAILURE_ACTIONS_ON_NONCRASH_FAILURES
(or the "Enable actions for stops with errors") option isn't enabled, which essentially means a non-zero exit code isn't treated as a restartable failure.
When installing Elastic Agent from the official Windows MSI package, this is the service config I end up with:
PS C:\> sc.exe qfailure 'Elastic Agent'
[SC] QueryServiceConfig2 SUCCESS
SERVICE_NAME: Elastic Agent
RESET_PERIOD (in seconds) : 10
REBOOT_MESSAGE :
COMMAND_LINE :
FAILURE_ACTIONS : RESTART -- Delay = 15000 milliseconds.
PS C:\> sc.exe qfailureflag 'Elastic Agent'
[SC] QueryServiceConfig2 SUCCESS
SERVICE_NAME: Elastic Agent
FAILURE_ACTIONS_ON_NONCRASH_FAILURES: FALSE
This works for restarting the service when it crashes in a very unhandled fashion.
But it won't restart if it manages to "cleanly" fail by returning a non-zero exit code. Something like a Go panic, or os.Exit(1), doesn't trigger a restart.
While sometimes this can be desirable, the code doesn't suggest to me this is intended. The impression I get is purely about intending to restart on failures.
For more information on the semantics of the SERVICE_FAILURE_ACTIONS_FLAG (winsvc.h), you'll need to search the Microsoft Docs yourself (unfortunately, the forum blocks me from linking it).
I think this flag should be set, to guarantee a restart in all situations.
The command-line equivalent of setting the flag is:
sc.exe failureflag "Elastic Agent" 1
(But I'm assuming this would be set programmatically.)
Without this flag, in practice, this means either manually restarting the agent, or deploying additional configuration to override the default behaviour.
Thoughts? Should this end up as a GitHub Issue?