Hello, I installed Fleet Server in the cluster and then installed several Windows and Linux agents. The status of all is healthy, but every few minutes these agents are displayed as offline. Although they are sending logs and are active, their status is offline. I am using version 8.17.
Hello,
This suggests some issues between the communication from the agents with the fleet server, check their logs.
The health checks are done with the Fleet Server, but the data is send to Elasticsearch, is a different communication, so you can have an agent that appear as offline but it is still sending data.
I also checked the logs. I also searched. Many people said to change the check-in time. Is that correct? Explain more clearly.
Said where? This is not correct, you have no control on the interval that the Agent does health check with the Fleet Server.
What do you have in the logs for the Agent?
As mentioned, if the agent appears as offline, it has some issue connecting with the Fleet Server.
let me define the problem one more time :
1 . when we want to install agents, first we check the time of the system and timezone , after that we check the connection between agent and the fleet server and everything is ok and there is no problem neither in connection nor the agent host even when the agent is offline we open the agent system and there is no problem and when you check it can connect to the fleet serve on the port 8220 and to the elasticsearch on the port 9200 and also logs are receiving to the cluster currectly but the agent is offline
2 . you said that we cannot change the checkin behavior of the agent but when you go to the elastic-agent.yml of the fleet server you can see a part in that file with the name agent-retry and you can change the connection timeout and retry
3 . when you see the other forums and ask the people who are expert in the elasticsearch they have expreienced this problem too
4 . other things like the amount of the ram or cpu can cause this problem ?
5 . one more thing is that even if you receive logs and the agent is online and healthy you can see that there are lot of error in the agent that say : checkin with fleet server error , retrying
Elastic Agent can be managed in 2 ways, Fleet Managed, where you have a Fleet Server and manges your policies and agents through the Fleet Server UI, or in the Standalone Mode, where you manage your agents and policies using yml files.
If you are managing Elastic Agents with Fleet Server, you have Fleet Managed agents, and with Fleet Managed agents you cannot change anything in the elastic-agent.yml
file and you also do not have access to all configurations available, so with Fleet Managed agents you haven o control for some settings like these ones.
This is what I mentioned on the previous posts, if your Agent is showing as offline in Fleet it means that it is having issues communicating with the Fleet Server and this kind of error confirms that.
There is not much else here to troubleshoot, you need to check if your network is having any issues or if the server that is running the Fleet server is having any issues.
What are the specs of your Fleet Server? How many agents you have?
first i thank you for your answering
my fleet server machine has good amount of RAM and CPU but the problem is still happend.
we have for example 15 or 20 agents and when you change anything in the agent policies, the agents get and effect that change without any problem and this show us the connection between fleet and agent has no problem. and also when you monitor the fleet page you can see that agents continuously become online and offline and this behavior show us the connection is ok.
also one day i wrote a linux service for it to restart the check-in process every two or three minute and i see that the problem get solved and agent is always healthy but this approach created a log each time it restart this process : "check-in retry loop was stopped" something like this
i think this is a bug or problem because for testing i created a single node machine that has everything in itself and also installed the fleet server on it and the strange thing is the fleet server that was in a machine with the cluster and there wasn't any more machine or networking device between them became offline many times
What are the specs? How much CPU and RAM? Please provide more context.
This does not show that the connection does not have problems, this shows that the agents can connect to the fleet server, but some network issues may be intermitent, so they can work sometimes and not work other times, the agents continuously becoming online and offline is a symptom of some intermitent issue.
You are restarting the agent every 2 or 3 minutes? Do not do that, this will solve nothing and can create more problems, the log you mentioned is caused by the constant restart of the service.
To think that this is a bug you need first to rule out any infrastructure and network problems, which was not done yet.
You need to provide more context, please share the specs of your fleet server, share some logs from your agents when they appear offline, share logs from your fleet server in the same time window, without this context it is not possible to troubleshoot, but the agents getting online and offline is a symptom of a communication issue.
Also share a screenshot of your Fleet Settings, go into the Fleet app in Kibana, the Settings tab and take some screenshots to share.
I didn't understand what you did here and what happened, you created a new single node cluster with fleet server and then the current fleet server became offline? If this happened there is something wrong with your network or configuration, but as mentioned you need to provide a lot of more context, you didn't share any logs nor settings.