Fleet server lose agents at restart

Hello

I'm a student who's getting started with Elastic, I have configured my stack with elasticsearch and Kibana.

I have an issue when I reboot my fleet server, all of my elastic-agents are offline (I have enabled the service elastic-agent on the server and is running after the reboot).

Does there is way to keep agents to the fleet server in case of crash or we must build it in a way that it can't totally shut down (redundancy)

Thank you very much !

Hello,

After Fleet Server restart, if you wait a few minutes, aren't the agents coming back online? I think the agents should continue to check-in after reboot. If they don't, there could be a different issue.
Are you using the latest stack version?

There is a guide about how to set up monitoring, to make sure Fleet Server is healthy: Monitor a self-managed Fleet Server | Fleet and Elastic Agent Guide [master] | Elastic

I confirm you the issue, I'm on an 8.6.1 stack.

After the reboot, my service seems to be running (on systemctl):

root@SRV-ElasticSearchAgent:~# systemctl status elastic-agent
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2023-02-14 16:51:33 CET; 5min ago
   Main PID: 526 (elastic-agent)
      Tasks: 63 (limit: 9460)
     Memory: 618.7M
        CPU: 12.576s
     CGroup: /system.slice/elastic-agent.service
             ├─526 /opt/Elastic/Agent/elastic-agent
             ├─600 /opt/Elastic/Agent/data/elastic-agent-b8553c/components/filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E manageme>
             ├─602 /opt/Elastic/Agent/data/elastic-agent-b8553c/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E manage>
             ├─605 /opt/Elastic/Agent/data/elastic-agent-b8553c/components/fleet-server --agent-mode -E logging.level=debug -E logging.to_stderr=true -E h>
             ├─606 /opt/Elastic/Agent/data/elastic-agent-b8553c/components/filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E manageme>
             ├─613 /opt/Elastic/Agent/data/elastic-agent-b8553c/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E manage>
             └─617 /opt/Elastic/Agent/data/elastic-agent-b8553c/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E manage>

févr. 14 16:51:33 SRV-ElasticSearchAgent systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..

And on my fleet integration, I don't see my two agents online and healthy after reboot.

I have installed the server agent with the quick install integration, last command to install it is:

./elastic-agent install   --fleet-server-es=https://<elastic-ip>:9200   --fleet-server-service-token=<token-provided>   --fleet-server-policy=fleet-server-policy   --fleet-server-es-ca-trusted-fingerprint=8fc9cf3da149e5b4d2b6aea8e9e2143cb3d279a0bf9391e50f1b1274db793b07 --insecure

I don't understand if there is a config file to edit. I don't have any TLS implemented on the stack for the moment. I receive agents data without issue before the server reboot.

To complete my message, I receive well logs from the agents, I think that it is only about agent and the fleet server that there is an issue:

here is the fleet server log:

And here the windows agent logs

15:53:09.605
elastic_agent
[elastic_agent][info] Unit state changed system/metrics-default (STARTING->HEALTHY): Healthy
16:51:01.939
elastic_agent
[elastic_agent][warn] Possible transient error during checkin with fleet-server, retrying
16:52:12.802
elastic_agent
[elastic_agent][warn] Unit state changed endpoint-default-e50a5103-f72f-453e-9b84-ab19e42b9462 (HEALTHY->DEGRADED): Applied policy {e50a5103-f72f-453e-9b84-ab19e42b9462}
16:52:12.803
elastic_agent
[elastic_agent][warn] Unit state changed endpoint-default (HEALTHY->DEGRADED): Applied policy {e50a5103-f72f-453e-9b84-ab19e42b9462}
16:52:39.331
elastic_agent
[elastic_agent][warn] Possible transient error during checkin with fleet-server, retrying
16:52:52.815
elastic_agent
[elastic_agent][info] Unit state changed endpoint-default-e50a5103-f72f-453e-9b84-ab19e42b9462 (DEGRADED->HEALTHY): Applied policy {e50a5103-f72f-453e-9b84-ab19e42b9462}
16:52:52.815
elastic_agent
[elastic_agent][info] Unit state changed endpoint-default (DEGRADED->HEALTHY): Applied policy {e50a5103-f72f-453e-9b84-ab19e42b9462}
16:55:28.698
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
17:03:10.587
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
17:15:10.081
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
17:23:42.070
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
17:30:41.501
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
17:42:38.325
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
17:49:08.067
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
17:57:51.593
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
18:07:32.657
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
18:16:03.454
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
18:28:48.687
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
18:41:41.563
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
18:51:47.884
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
19:03:58.562
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
19:18:48.143
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
19:24:57.999
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
19:30:46.553
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
19:42:26.619
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
19:50:05.698
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
19:58:55.743
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
20:11:37.533
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
20:19:04.464
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
20:29:08.276
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
20:42:10.242
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
20:51:19.503
elastic_agent
[elastic_agent][info] Unit state changed endpoint-default-e50a5103-f72f-453e-9b84-ab19e42b9462 (HEALTHY->CONFIGURING): Applied policy {e50a5103-f72f-453e-9b84-ab19e42b9462}
20:51:19.503
elastic_agent
[elastic_agent][info] Unit state changed endpoint-default (HEALTHY->CONFIGURING): Applied policy {e50a5103-f72f-453e-9b84-ab19e42b9462}
20:51:39.515
elastic_agent
[elastic_agent][info] Unit state changed endpoint-default-e50a5103-f72f-453e-9b84-ab19e42b9462 (CONFIGURING->HEALTHY): Applied policy {e50a5103-f72f-453e-9b84-ab19e42b9462}
20:51:39.515
elastic_agent
[elastic_agent][info] Unit state changed endpoint-default (CONFIGURING->HEALTHY): Applied policy {e50a5103-f72f-453e-9b84-ab19e42b9462}
20:52:26.651
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
20:59:34.003
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
21:13:54.020
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying
21:21:18.748
elastic_agent
[elastic_agent][error] Cannot checkin in with fleet-server, retrying

Later, there is a connection for a short period of time (I did nothing which can explain that)

I hope this could help :wink:

From the logs it seems that Fleet Server is not coming back up after the reboot.
Do you see any Fleet Server logs? You can gather the diagnostics bundle with the elastic-agent diagnostics collect command.

I tried to reproduce the issue with fleet server on a different server than elastic stack. It makes the issue on reboot only if both servers shut down in the same time, if one is down or the other one but not both, it works without any issue.

Logs of the server with the command

root@SRV-Grafana:~# elastic-agent diagnostics collect
[WARNING] Could not redact state.yaml due to unmarshalling error: yaml: invalid map key: map[interface {}]interface {}{"unitid":"beat/metrics-monitoring", "unittype":1}
Created diagnostics archive "elastic-agent-diagnostics-2023-02-15T12-36-50Z-00.zip"
***** WARNING *****
Created archive may contain plain text credentials.
Ensure that files in archive are redacted before sharing.
*******************
root@SRV-Grafana:~# 

I don't know how I can share the zip file

You can look at the logs yourself in the zip if you see any Fleet Server specific errors.
This forum is not really suitable for sharing your diagnostic info, you could upload it to Dropbox and share the link after sanitizing credentials.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.