Agent randomly stops sending data but still checks in to fleet

greggailly · July 3, 2021, 9:56am

Hi everybody,

In case it might be related, the agents were enrolled according to the procedure described in the following topic https://discuss.elastic.co/t/agent-stuck-on-updating-when-enrolling/277703 as enroll process failed otherwise.

For some reason, even though the agents show up as healthy in Kibana, we don' t receive any of their data after some time. These same agents work perfectly for a few days before they stop sending data.

From the logs we extracted the following lines:

{"log.level":"error","@timestamp":"2021-07-01T21:04:03.411+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-01T21:26:46.605+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-01T21:39:46.169+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: ","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-02T00:23:42.924+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: ","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-02T03:29:30.201+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: ","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-02T07:08:32.961+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: ","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-02T08:13:37.392+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: ","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-07-02T21:17:42.031+0200","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping","ecs.version":"1.6.0"}

At least two agents have failed at approximately the same time. However another 4 kept going. They are all synced with the same fleet-server which has never stopped.

Unfortunately even after restarting the agent nothing happens. In this case we must uninstall it and re-enroll it in fleet.

What could be the source of this problem ?

Thanks for your help.

mtojek · July 5, 2021, 9:10am

Did you notice any errors in Kibana logs?

greggailly · July 5, 2021, 8:19pm

Not that we could notice so far. Unfortunately we don't have the exact logs for Kibana for when these kind of events happened. We keep them for a much longer period now so we will check them when an agent disconnects again.

greggailly · July 6, 2021, 9:05am

So this happened again last night.
We got the following log from the agent:

"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://fleet.ourdomain.com.com:443/api/fleet/agents/218e0401-7918-4358-b730-b95357fefe68/checkin?\": dial tcp XX.XX.XX.XX:443: connectex: No connection could be made because the target machine actively refused it.","ecs.version":"1.6.0"}

Nothing on kibana side.
Weirdly, 6 other agents using the same url/fleet-server kept running.

Another interesting note, when changing the policy for the "ghost" agent, the update happens correctly so its seems the agent can communicate with fleet server. However previous (unreceived) logs or new logs are still not sent to kibana (or at least not seen).

We will put Caddy under debug mode to check if any errors can be detected the next time this happens.

system · August 3, 2021, 11:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fail to checkin to fleet-server Elastic Agent fleet	17	6773	July 10, 2023
Elastic-agents goes offline and get back online status frequently Elastic Agent	6	1013	December 19, 2023
Fleet of agents healthy but not sending data Beats fleet	4	2557	September 30, 2021
Fleet Server is not Healthy Elastic Agent fleet	6	88	September 4, 2024
Elastic Agent won't enroll Elasticsearch fleet	12	3261	October 7, 2021

Agent randomly stops sending data but still checks in to fleet

Related topics