Elastic-agents goes offline and get back online status frequently

Hi All,

Version:8.9
OS:Windows Server
I have encountered with this problem so recently.
And it is not occurs for all elastic-agents. it happens for 4 elastic-agents.

The problem is the agent status seems offline when ı checked from kibana gui.
But it sends logs clearly, ı think there should be communication problem between elestic-agent and fleet server.

But ı can not understand the real problem.

I'm sharing the agent's logs.

07:08:13.030
elastic_agent
[elastic_agent][error] Checkin request to fleet-server succeeded after 1 failures
07:21:29.180
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
07:28:51.054
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
07:41:44.423
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
07:54:34.937
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
08:11:25.420
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
08:20:53.701
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
08:30:14.602
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
08:42:26.377
elastic_agent
[elastic_agent][error] Checkin request to fleet-server succeeded after 9 failures
09:03:33.907
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
09:14:47.144
elastic_agent
[elastic_agent][error] Checkin request to fleet-server succeeded after 3 failures
09:30:03.732
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
09:37:10.745
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
09:53:40.139
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
09:53:40.139
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying

And also the another weird thing that ı saw in the fleet server is the fleet running on ipv6 for 8220 port.

Also it is running for localhost:8221 on ipv4. I do not know how that happens. Because we set it up for ipv4.

Any help would be good for us.

log.level":"warn","@timestamp":"2023-11-07T11:47:48.571Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://[IP]:8220/ errored: Post \"https://[IP]:8220/api/fleet/agents/c8838828-5def-49dd-89ce-69fa3acefd50/checkin?\": read tcp [IP]:49630->[IP]:8220: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.\n\n"},"request_duration_ns":180379102200,"failed_checkins":2,"retry_after_ns":178401279067,"ecs.version":"1.6.0"

This is the actual error. it could help to solve the problem.

any help guys?

Hi @MichelLaterman,

Do you have any chance to look at this case ?

Regards

This looks like a network issue.
Running on port 8221 is expected (fleet-server starts a local communication port for the local elastic-agent that runs it).

Do you have proxies on your network between the agents and fleet-server?
Can you provide a diagnostics bundle?

Hi @MichelLaterman,

Thanks for the response,

There is no proxy between agents and fleet-server. just firewall. also, there is no drop in fw.
Also whenever ı telnet the fleet server, ı can reach the server.

How can ı share the diagnostic in a safe way?

Regards

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.