Agent randomly stops sending data but still checks in to fleet

Hi everybody,

In case it might be related, the agents were enrolled according to the procedure described in the following topic https://discuss.elastic.co/t/agent-stuck-on-updating-when-enrolling/277703 as enroll process failed otherwise.

For some reason, even though the agents show up as healthy in Kibana, we don' t receive any of their data after some time. These same agents work perfectly for a few days before they stop sending data.

From the logs we extracted the following lines:

{"log.level":"error","@timestamp":"2021-07-01T21:04:03.411+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-01T21:26:46.605+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-01T21:39:46.169+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: ","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-02T00:23:42.924+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: ","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-02T03:29:30.201+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: ","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-02T07:08:32.961+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: ","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-02T08:13:37.392+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: ","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-02T21:17:42.031+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping","ecs.version":"1.6.0"}

At least two agents have failed at approximately the same time. However another 4 kept going. They are all synced with the same fleet-server which has never stopped.

Unfortunately even after restarting the agent nothing happens. In this case we must uninstall it and re-enroll it in fleet.

What could be the source of this problem ?

Thanks for your help.

Did you notice any errors in Kibana logs?

Not that we could notice so far. Unfortunately we don't have the exact logs for Kibana for when these kind of events happened. We keep them for a much longer period now so we will check them when an agent disconnects again.

So this happened again last night.
We got the following log from the agent:

"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://fleet.ourdomain.com.com:443/api/fleet/agents/218e0401-7918-4358-b730-b95357fefe68/checkin?\": dial tcp XX.XX.XX.XX:443: connectex: No connection could be made because the target machine actively refused it.","ecs.version":"1.6.0"}

Nothing on kibana side.
Weirdly, 6 other agents using the same url/fleet-server kept running.

Another interesting note, when changing the policy for the "ghost" agent, the update happens correctly so its seems the agent can communicate with fleet server. However previous (unreceived) logs or new logs are still not sent to kibana (or at least not seen).

We will put Caddy under debug mode to check if any errors can be detected the next time this happens.